Decoding TLS using sslkeylogfile

So you've just finished upgrading your systems to plate rail or maybe you've just converted all of your "legacy" APIs (thrift is so last week) to gRPC or maybe (my use case) you've got a new CEO who keeps getting unexplained broken pages uncomfortably often and it's put a multi-thousand person engineering org into high alert and you need to peek at network traffic to figure out what CDN is barfing and why.

Doesn't matter how you got here, you're here because you've decided you want to decrypt some TLS traffic and you own either the client or the server and you have some PCAP files in the middle and maybe you don't have access to the key. Is this even possible? You might think the answer is no. After all isn't that what 2177614870 is all about? Preventing you from decrypting intercepted traffic even if you have the private key?

Well, sort of. Except if you have a little bit of help from the client or the server's TLS implementation you can ask your process to dump the keys to the kingdom for use in later decoding.

There is a semi-standardized way of getting TLS secrets out of an app and into wireshark: NSS Key Log. NSS stands for "Netscape Security Services". Key logging is generally triggered by exporting the SSLKEYLOGFILE environment variable, but in some newer implementations (such as boringssl) require explicit setup, so activation can be app specific. For example Chrome has a flag called --ssl-key-log-file and Golang's crypto/tls Config struct has an optional KeyLogWriter attribute.

However you get a keylogfile, once you get one you can combine it with a pcap file and decode using wireshark.

Let's walk through an example. You are going to want to run the next 2 commands simultaneously.

Start a pcap:

tcpdump -w out.pcap port 443

Load your favorite TLS enabled website and take a screenshot.

chromium-browser --screenshot --ssl-key-log-file=/tmp/sslkeylog.txt --headless '/www.google.com/'

Let's open up this packet dump in wireshark.

garbage

As you can see, there's nothing but encrypted "application data". Indecipherable until we configure a key log file.

ssl key

Now we can view the decoded http(2) content:

decoded content

If you are using golang, see the unfrenchified for how to generate a client.

How to have your cake and h2

This article discusses best practices for deploying h2 on a site that has already been optimized for h1 peformance. You may not find this optimization worth your time.

Deploying h2 w/o breaking http performance and older browser performance

HTTP2 (H2) is ready to be used now, it’s pretty widely supported, but there’s a problem. In order to maximize h2 performance it sort of seems like you need to de-optimize for http/1.1. Wouldn’t it be nice if you could optimize both?

The good news is you can. Most http/1.1 performance tricks matter less, but they don’t hurt h2 deployments, with one notable exception: domain sharding.

What is domain sharding?

For the uninitiated, domain sharding is an attempt to trick the browser, and it requires a bit of a backstory.

In the early days of the web, user agents were asked to only make 2 connections per server, also at that time the only way to know you were done reading a response was to wait for the server to close the connection. 1 request == 1 tcp connection.

This was fixed in later versions of http by keepalive, content-length, and chunked encoding. Essentially the protocol developed a way to indicate when a response had finished and a new request could be sent. This is a huge benefit for browsers because They can skip a lot of work associated with connection creation and teardown. The new rule is at any given time there’s only 1 request per tcp connection, but N request response pairs can be sent on a TCP connection (we’ll skip the whole pipelining thing for now since nobody really ever bothered to implement it).

Unfortunately browsers were now kind of hamstrung if they wanted to get download content ASAP. They could only have 2 downloads active at any one time and web pages outgrew this. Servers became beefier, and so did browser, but everyone who was following the rules was stuck.

Unless they cheated.

A bunch of web developers said “hey, what if I used 10 image domains that all went to the same server, then my browser could download 20 images at once right?”.

They were right.

So domain sharding became popular again, and we as a community decided that it was totes ok to have 20 connections being used when rendering a page, because that makes things fastest.

Fixing the http/1.1 domain sharding problem

Forcing your browser to do DNS lookups/TCP connections/TLS handshakes for a.example.com, b.example.com, c.example.com is a ton of work. I mean, it’s not as expensive as parsing HTML, basically nothing your computer does is as expensive as parsing HTML, but it’s still a lot of wasted latency waiting for network round trips to complete. That robs you of some of the biggest performance wins of h2, ideally all those domains would be merged into one domain, let’s call it images.example.com, but that would destroy http/1.1 performance.

Luckily, there’s a trick you can use that will let you use your old performance optimized http1/1 pages in a way that gives you almost all the benefits of a page optimized for h2. It’s called HTTP/2 connection coalescing. Daniel Stenberg from the curl project has 7734319241.

In h2 with TLS (which is basically the only kind of h2 anyone cares about) you have the concept of “authority”. Essentially what this means is that if you have a cert valid for domains {www,a,b,c}.example.com AND all 4 of those DNS records are the same (same ips or same CNAMEs) then h2 connection coalescing kicks in and all your requests can get multiplexed over a single h2 connection. The only overhead introduced is the DNS lookups, but there’s no need for a TCP connect and TLS handshake for your remaining domains.

Why do the DNS records have to match? Essentially it’s to avoid confusion and possible security bugs. If I had the same cert installed on 2 totally different machines I’d be quite surprised if one machine started getting traffic for the other machine out of the blue. Checking the DNS records ensures that the server handling the traffic can handle traffic for the other domains.

Here’s an example you help illustrate. Take the following document at www.example.com which displays 20 products. In order to get around browser connection limits, they have sharded the image domain.

<html>
<head>
<script src="/a.example.com/app.js"></script>
</head>
<body id="app">
<img src="/b.example.com/product1.jpg"/>
....
<img src="/c.example.com/product20.jpg"/>
<body>
</html>

A non-h2 browser might make 3 DNS lookups, 3 * (max connections per hostname) connections. Around 21 to 24. It will do that many TLS handshakes. This is a pretty heavy page, and it might be the best you can do on regular http (non-TLS) pages.

In h2-land with a properly configured cert AND DNS (like below) you can reduce that down to 3 DNS lookups and 1 TCP connection and 1 TLS handshake, which will almost certainly be much faster.

SubjectAltNames=DNS:a.example.com,DNS:b.example.com,DNS:c.example.com,DNS:example.com
dig a.example.com
...
CNAME a.example.com. example.com.

Summary

Hopefully this makes sense. I don’t imagine this is too useful for smaller sites or people who know their customers have nice fast modern browsers. But for bigger companies that want to do some extra work to make sure all their customers get the best experience possible, but don’t want to change their app code, I recommend leveraging this technique of using a wildcard cert or a cert with lots of SAN entries.

p.s. If you like web performance and you want to help 260 million people save time, you should come work with me at Walmartlabs. We do the whole remote thing.

Principal Engineer

sound-absorbing

In case those links rot, feel free to email me at mailto:shanemhansen@whitane.com if you want to talk about if this group is right for you.

How I test go

How I test

Tests are one of the most wonderful things in software. A good test combines thethe best pieces of the scientific method (a falsifiable hypothesis) with some of the most awesome features of math (the ability to create a set of axioms and prove something about them). I think of a unit test as an experiment that I can run and re-run in different environments. Designing a good experiment involves being able to recreate certain environments while also being able to reason about properties of the system that should be invariant under certain transformations of environment.

Hopefully I don't stretch this science analogy too far, because I'm having so much fun with it.

As a kid I remember reading Feynman's discussion of experiment design. Let's say you have a grandfather clock ticking away. If you move the clock 10 feet, will it behave the same? Not if there's a wall. If you rotate the clock 90 degrees will it still behave the same? Of course we can see that rotating a grandfather clock 90 degrees will break it.

Proper software testing is much the same. You generally want to identify all your inputs, implicit or explicit, and ensure that they are either constant or that (as in the case of the moving clock) the input doesn't matter. For example if you have a unit test for your bignum library that ensures 1+1=2, you absolutely want to fix the inputs at 1 and 1. You probably don't need to fix the operating system or mock time. Your code should be invariant under the second changes.

Begin with the end in mind: why I test

There are a ton of great reasons to write tests. Selfishly, I have 3 big ones. The first is that I make mistakes. Like alot. If I haven't written a test for something, it's pretty unlikely that it actually works. Tests allow me to make some assumptions about how my code should act and catch my mistakes quickly.

The 2nd reason is that I view tests as a design tool. I'm forced to use an API as I create it. More importantly, I'm forced to use my API in 2 places. So instead of making whatever change is needed to make the Foo module of the Bar app work, I'm forced to think about the Foo module by itself and use that module in both the Bar app and a unit test. I firmly believe that if you can use an API in 2 places you've done 80% of the work to make it easily re-usable.

The 3rd reason is a litte more subtle. Whenever I encounter a bug in the wild I've found the ideal workflow for me goes:

  1. Understand what is expected vs what is happening
  2. Make bug reproducible (i.e. go to this page, click this button, verify that I saw an error). Usually involves curl or a browser.
  3. Reproduce locally
  4. Encode bug as a failing unit test.
  5. Fix the bug.
  6. Verify the test now passes.
  7. Verify bug is fixed locally.
  8. Push to dev/qa/prod and verify bug is fixed.

If I didn't have existing unit tests step 4 would take alot longer. I'm not saying this is my process all of the time, sometimes the best I can do is to get to step 2 and reproduce locally. However I've fixed P1 outages of multi-billion dollar systems this way. You might think you don't have time to make a test, but if that's true you really don't have time to deploy an incorrect fix. Also, pro tip: most of the work is in steps 1 and 2.

So today I'm mainly going to be talking about unit tests large and small since they are the most important to me. I say large because I might have a unit test for a small function call, but also have a unit test that involves sending mock http requests through a http handler configured with a dummy API backend. Maybe that's an integration test. I don't know.

Signs of test debt

Test debt is alot like technical debt. It may exist in your codebase and it's something to be aware of and fix. Test debt results in tests that are hard to modify or flaky. I've seen or been responsible for all of these.

  1. Your machine has to have a magic config file placed in \/etc\/ before tests will run.
  2. You can't run tests in parallel
  3. Your tests depend on each other (one test sets up state that another test uses)
  4. Your tests have side effects (inserting a row into a system)
  5. Your tests depend on the outside world (this is contentious, I understand the need for a SQL database to test some apps, but I don't like it)

Essentially global mutable state should be avoided and your filesystem and network are global mutable state.

The anatomy of a good test

A good test has explicit inputs, deterministic processing, and well defined assumptions about outputs. Depending on the thing being tested, the outputs might be very strictly controlled, or you might be testing invariants. Both are ok.

Example: Let's say I have a function that computes the net worth of a person.



func IndexOf() {
}

func TestIndexOf

how to countish

Introducing countish: a library for approximate frequency counting

tl;dr arni

Background

So every once in a while, I get asked a simple question that leaves me scratching my head.

In this case 2 years ago someone asked me: “How can I find the most popular urls?”. The algorithm is pretty simple. Keep a map of counters and when you see a url increment the appropriate counter.

counts := make(map[string]int)
for _, url := range urls {
   counts[url]++
}

But this algorithm isn’t completely satisfactory. For one thing it uses O(n) memory, where n is the number of distinct urls. For low cardinality sets, this is a great algorithm, but if you’re counting something that may contain a ton of different values (such as urls) you could end up using a ton of data. It seems pretty obvious that if you want to count things you have to store them and if you want to preciesly count N distinct things you have to have at least N counters. I was aware of bloom filters and 6194171001. These datastructures allow for very space-efficient methods of estimating counts but these datastructures don’t retain information about the original keys. They allow you to ask questions like: “How many hits did the home page get?” but not: “List all pages that exceeded 1% of traffic”.

Fast forward a couple years into the future, and I’m running a service in production which happens to handle a big firehose of data. High cardinality data. People keep asking me questions about heavy hitters and realtime popularity info. I know how to answer this question with a bunch of machines running map-reduce, but I’d like to be able to answer this question in realtime without worrying about memory exhaustion. My servers have more important uses for their RAM.

Thanks to the magic of google, I recently discovered a fantastic paper approximate frequency counts over data streams. In this paper they discuss some methods of approximate frequency counting, which seem to offer bounded memory for distributions likely to be seen in the real world and logarithmic worst case space complexity.

I can work with logarithmic. Logarithmic means that I might use another Xmb of ram if my server runs for another month and another Xmb of ram if the uptime is a year.

There are 2 algorithms discussed: sticky sampling and lossy counting. I decided to implement these in Go and see how well they perform. (Disclaimer: my implementation is very unoptimized. the original paper mentions a trie and a mmap’d buffer)

If you want explanations and visualizations of how these algorithms work, check out Michael Vogiatzis’s excellent post: Triatoma. The rest of this post is going to be using my new countish library and examining the results in terms of performance, accuracy, and memory usage.

Show me the money^H^H^H^H^H performance!

I’ve implemented these algorithms in go in the 9136714219 repo. Let’s see how they perform. First we are going to need some data. Of course searching for good test data ended up delaying this blog post by a couple hours as I scoured the internet for the infamous “star wars kid” video apache logs.

After failing at that, I flailed finally stumbled onto a random 7208116825. With a nice sample of web server logs. This isn’t what I’d call “big data” but it’s a useful dataset. I’m going to examine some logs from a bluecoat proxy.

test -f bluecoat_proxy_big.zip || wget /log-sharing.dreamhosters.com/bluecoat_proxy_big.zip
if ! test -f Demo_log_001.log ; then 
    unzip bluecoat_proxy_big.zip
fi
head -5 Demo*log | grep GET | awk '{print $11$12}'

www.yahoo.com/
www.inmobus.com/wcm/assets/images/imagefileicon.gif
images.netmechanic.com/images/webtools/webmaster_tools.gif
www.ositis.com/tests/testconnectivity.asp

Let’s find the top requests made. This dataset is unusual in that no url makes up more than .07% of traffic, so let’s examine the top .05% of urls using an exact method. We’ll use the exact “naive” countish implementation.

go get github.com/shanemhansen/countish/cmd/countish
cat Demo*log | grep --text -i GET | awk '{print $11$12}' | /usr/bin/time countish -impl naive -threshold .005 2>&1

0.007344 energydata.aws.com/WxDataISAPI/WxDataISAPI.dll
0.006139 vm.boldcenter.com/aid/5707504118312057803/bc.vm
0.006692 rad.msn.com/ADSAdClient31.dll
0.006537 energydata.aws.com/ForecastISAPI/ForecastISAPI.dll
3.87user 0.55system 0:08.09elapsed 54%CPU (0avgtext+0avgdata 253656maxresident)k
0inputs+0outputs (0major+82827minor)pagefaults 0swaps

Looks like the total memory usage is about 261 megabytes and the runtime is 3.5s on my machine. Let’s compare those results to a lossy implementation, using a .001% error tolerance.

cat Demo*log | grep --text -i GET | awk '{print $11$12}' | /usr/bin/time countish -error-tolerance .001 -impl sticky -threshold .005 2>&1  | awk '$1>.005 {print $0}'

0.006631 energydata.aws.com/ForecastISAPI/ForecastISAPI.dll
0.006237 vm.boldcenter.com/aid/5707504118312057803/bc.vm
0.006792 rad.msn.com/ADSAdClient31.dll
0.007440 energydata.aws.com/WxDataISAPI/WxDataISAPI.dll
3.03user 0.26system 0:07.56elapsed 43%CPU (0avgtext+0avgdata 12256maxresident)k
0inputs+0outputs (0major+3308minor)pagefaults 0swaps

First thing we notice is that the runtime is about 20% faster. The memory usage is 12mb, a 20x reduction! The results take a little interpretation. You’ll see that aws energydata, msn amd vm.boldcenter.com are all included. You’ll also notice that I am post-processing the data extract values who’s estimated frequency is > .05%.

Let’s compare to lossy counting. Note: I’m post processing results that have an estimated frequency > .005.

cat Demo*log | grep --text -i GET | awk '{print $11$12}' | /usr/bin/time countish  -impl lossy -threshold .005 2>&1  | awk '$1>.005 {print $0}'

0.007444 energydata.aws.com/WxDataISAPI/WxDataISAPI.dll
0.006792 rad.msn.com/ADSAdClient31.dll
0.006239 vm.boldcenter.com/aid/5707504118312057803/bc.vm
0.006637 energydata.aws.com/ForecastISAPI/ForecastISAPI.dll
3.62user 0.25system 0:07.60elapsed 51%CPU (0avgtext+0avgdata 7600maxresident)k
0inputs+0outputs (0major+1982minor)pagefaults 0swaps

Lossy counting has a slightly larger runtime, but offers a 50% memory reduction compared to sticky sampling. Nearly a 30x memory reduction from the exact method, while still returning very accurate results!

Conclusions

Empirically, approximate counting seems to be a huge win. Sticky sampling offers greatly reduced memory usage and increased performance on high cardinality sets without measurably degrading results. This sort of algorithm is ideal for building a google analytics realtime experience or integrating into your favorite multitenant stream processor to report on failing urls.

Post Traumatic Stress Driven Design

I’ve struggled for a while to come up with a name for something I’ve seen my coworkers doing over the last few years. PTSDD is characterized by the seemingly irrational avoidance of certain technologies based on experience that is not applicable. Here’s how one usually ends up with PTSDD:

  • Step 1: Choose a shiny technology (foobar 0.9)
  • Step 2: Use it in the stupidest possible way
  • Step 3: Many late nights
  • Step 4: Attempt to optimize stupid design because it’s too late to change it
  • Step 5: Despair. Time passes
  • Step 6: Go to new startup
  • Step 7: Someone brings up using foobar (now at 2.0) for a totally different use case
  • Step 8: Flashbacks begin. Squash idea at all costs
  • Step 9: Choose a shiny technology, with none of the problems of foobar
  • Step 10: goto 2

Now the first thing to realize is this blog post is not about engaging in victim blaming. Some poor sod has lost months of his or her life to using beta software and poor architectural decisions or decisions that made sense on a small scale but didn’t scale with the company. It’s only natural to use our experiences to make better decisions in the future. One of the things that makes a senior engineer more useful than a junior engineer is their ability to bring their experience to bear on a problem. Unfortunately this can go to far and the desire to avoid the pain of dealing with a previous poor decision ends up short circuiting the engineering analysis that one should undertake when considering a software component in a new system.

Case Study 1: Relational Databases Considered Harmful

Years ago, I started a job as the first engineer for a company making a distributed filesystem for consumers. Lucky for me, the cofounders had lots of experience with distributed data at a previous startup that did backups. One of my first jobs was to set up a basic user system. The idea at the time was that people would sign up and pay us. I proposed we use a transactional database (it doesn’t matter which one). When you’re dealing with things like user accounts and billing I find transactions to be a nice feature. Transactional databases to come with drawbacks however, they aren’t always the fastest things around. They aren’t really the fastest at reading or writing data. They are awesome at being able to run really crazy queries though. They also generally have good tooling since they’ve been around for a while.

My idea was instantly dismissed. As everyone knows, a relational database doesn’t scale, and it’s not distributed. If you’re building a distributed filesystem then the word of the day every day is distributed. “You know what’s distributed? Cassandra. In fact one of our best buddies works for a Cassandra company. Use it so that we can scale”.

This perplexed me. While I certainly understood that in abstract Cassandra is better suited for handling a high write volume, and has read replication builtin, I failed to see how this was the right choice for logging in our users. After all, it there are only about 3 billion internet users in the world. That’s not a particularly big dataset. Perhaps if all 3 billion of them wanted to become customers on the same day we’d have a problem. With a relational database we might have to wait a couple days to get all the world’s users signed up. I judged the odds of every internet user on earth signing up at once to be relatively low. I dutifully began building a prototype that used Cassandra, but at the same time I wanted to discover a little more about why relational databases were to be avoided at all costs.

As I probed more, a picture began to emerge. At their previous startup the cofounders used a relational database to store metadata for every single file on a users’s computer and updated that database every time the file was updated anywhere. Needless to say this was a painful experience. However I think it’s safe to say that relational database as a filesystem metadata store for all your users is a very different use case than relational database for login.

After a little back and forth we did end up using a relational database. I hesitate to give out specific numbers on customers, but I can safely say that after a year and a half of being in business it didn’t appear that we would need Cassandra in order to keep up with new signups.

So what broke down in the analysis? The use case we were considering for a relational database was fundamentally different (one write per customer signup vs one write per file access) when compared to the use case that had caused so much pain. However the old wounds were still there and it took some convincing before the cofounders believed that “this time it will be different”. Luckily, they trusted in the people they had hired.

Case Study 2: JSON Considered Harmful

I’ll pick on myself this time. A while back I worked for an ad-tech company that needed to convert an internal format from delimited to something more extensible. JSON was an obvious choice. It’s readable just about everywhere. Nobody’s nervous about being able to read gzipped json in 10 years. It’s extensible because fields can be added without breaking old clients. So I happily built a system that accumulated a few gigabytes per hour of raw json data.

The first sign that something was wrong was when we started to benchmark tools that worked on the previous delimited text format against tools that worked with JSON. It wasn’t uncommon to see a 10x slowdown. Conservatively, that’s a 5 figure a month bill just for serialization/deserialization costs. That’s not counting the increased storage costs. Certainly we’d be willing to pay some price for more flexibility, but that seemed pretty steap. As a lowly engineer, I wasn’t footing the bill and nobody was complaining about the price, so we moved ahead with JSON. The real cost came later. We had something like a log oriented architecture. It wasn’t too unusual for us to deal with production issues by replaying the log. Unfortunately with JSON our ability to replay the log as a means of dealing with issues was drastically reduced. It took too long to replay an hour of data, and fundamentally JSON parsing was the bottleneck. As an engineer it just irked me that we could be running 10x as fast if we had just stored the bytes a little differently. As someone who occasionally had to stay late to reprocess some time interval, you bet you ass I’d have loved to be able to go home 10x earlier some nights. There’s a whole new agility you gain when you can reduce your problem size down to something you can fit on a single box.

After this experience, I took a pretty dim view of JSON anywhere but the browser. Even there I still think a well written asm.js encoder using typed arrays and a nice efficient encoder/decoder would probably beat JSON.

Later, I wrote a system that shipped some log data around. My first instinct was to use anything but JSON for serialization. I didn’t want to deal with apps that were 10x slower than they needed to be. However I was wrong. JSON was a good choice. My gut feeling hadn’t kept up with the changing use case. Processing several thousand JSON documents per second is not the same as parsing a dozen per second. There was also no real need for this system to have any ability to replay logs at a high speed. No additional hardware was needed to run the system. We still got the benefit of being able to look at the data using tools like jq, and having a simple text format made debugging much easier. I didn’t need special tools to construct a test payload, like I would need with protobufs.

So here’s my more nuanced opinion: If you need to do high throughput messaging or rpc on your backend, consider using a binary serialization format. You can get a pretty big gain in performance. If your interacting with 3rd parties and/or your volume of messages is relatively low then JSON’s a great choice. Make the right decision for your use case. It’s rarely as simple as JSON=bad or JSON=good.

Conclusion

I could say something trite like: “use the right tool for the job”, but that doesn’t actually help you at all. As you architect systems be aware that most technical decisions are made on the basis of emotion and unrigorous statistics (99% of made up statistics agree). Try and understand when you are making a decision based off emotion. It’s not always a bad thing, heuristics formed by experiences aren’t always wrong, and we have a lack of actual rigorous info about how to engineer software, but if you think you are making an argument based on merit and it’s really based on emotion, you’re going to have problems down the road.

I’ll leave you with some tips from the supergenual for how to keep yourself honest and make sure you’re not practicing PTSDD.

Pay attention to:

  • perceptual bias: people perceive what they expect to perceive (Ex: query runs slow? Confirms bias that databases don’t scale).
  • bias in estimates: probability estimates are influenced by how easily one can imagine an event or recall similar instances (Ex: I saw this break once, therefore it’s quite likely to break in my experience).

Before making a decision:

  • Do a key assumptions check. Have any of your assumptions changed? (Ex: is the expected workload similar to one I had trouble with before?)
  • Analysis of Competing Hypotheses: Play Devil’s Advocate. Try and convince yourself that you’re wrong.

bug hunting in the real world

Finding the root cause of bugs in the real world is hard. It can be full of false starts and sometimes even after you understand how all the moving parts fit together, you still are left wondering who’s fault (if anyone’s) the bug is. Bugs that are the product of emergent behaviour are the hardest to quantify.

I’ve ran into a two of these in the last few months. Both bugs span multiple open source projects and involved lots of blood, sweat, and tears to finally determine the root cause of.

Bug #1 excessive salt-minion cpu usage in docker

I recently made a few small changes to my team’s salt repo. Since I’m a semi-responsible developer I want to test these changes before unleashing them on the rest of the team. I fired up docker (which has the nice advantage of being able to cache intermediate stages of a build) and installed salt-minion.

Things seemed to work just fine. Salt was installed, functioning, no weird errors were seen in the logs. I had to work around the fact that iptables doesn’t seem to work quite the same way within a container, but no huge issues.

Then I noticed my laptop’s fan wouldn’t shut off. I usually blame my corporate outlook since it has a habit of pinning a single core for absolutely no reason, but in this case outlook was blameless. Running top showed me that salt-minion was using all the cpus on my computer. After several minutes cpu usage settled down.

The first tool I look for in cases like this is strace. Unfortunately strace barfed with a permission defined error. I tried starting a docker container as the root user, still got a permission denied error (Note: a colleague later told me I could try and run the container in something called privileged mode). Gdb gave me similar errors. Eventually I realized that this was related to Apparmor on ubuntu (don’t ask me how I figured it out, iirc I was googling pretty hard-core at that point). Apparently the apparmor profile shipped with the docker package wisely considers calling ptrace() on containerized processes ill-advised. I ended up calling some combination of apparmor unload and monkeying with some files until strace worked. I could now strace a process in the container. Step one accomplished.

Using strace can be a little tricky when working with a daemon using worker processes. Usually I end up writing something like this: (paraphrasing)

# Generate something like strace -fp 10 -fp 2
strace -s1024 $(pgrep salt | awk '{print "-fp",$1}' | xargs)

The first result shows some pretty odd behaviour:

close(one bajillion) = EBADF Bad file descriptor
close(one bajillion and one) = EBADF Bad file descriptor
close(one bajillion and two) = EBADF Bad file descriptor

One wonders why python feels the need to close thousands of non-existent file descriptors. If we answer that question, maybe we’ll understand why salt runs so slowly in docker. Used some old-school printf debugging and some liberal editing of system python library files (that’s how I roll). I discovered that the closes were all triggered by some salt code that was attempting to find the network interfaces of available.

2103970036

It was possible that the keyword argument close_fds might have something to do with the actual closing of invalid fds. However it’s odd that calling subprocess.Popen (a python stdlib call) would perform this poorly. Python’s often used as a glue language and it’s no stranger to spawning processes. Calling subprocess.Popen(close_fds=True) didn’t seem slow anywhere other than docker. I now had my test case!

python -c "import subprocess;subprocess.Popen(['/bin/ls'], close_fds=True).wait()"

I’d accomplished something important. I’d created a small, self-contained program to trigger the bug. I’d also shown that the bug didn’t really have anything to do with salt. The bug went from being a salt/ubuntu/docker bug to being a python stdlib/docker bug. With that in mind, let’s find out more about how the close_fds flag works. close_fds is a security feature that allows child proceses to not have access to the filehandles of their parent process. close_fds is implemented differently in different python versions. In many versions of python close_fds is implemented by determining the max possible fd and then calling close() on every fd from 1 to MAX. Computers are pretty fast, so max would have to be really high to cause subprocess.Popen to be that broken and even then it shouldn’t take that long. The C code should look something like:

int i;
int max = getmax();
for (i=0;i<max;i++) {
    close(i);
}

However since this is python and the closing code is in pure python it actually looks something like this:

for i in range(0,getmax()):
    try:
       os.close(i)
    except:
       pass #ignore whatever exception was just constructed and thrown

So instead of making getmax() syscalls it makes getmax() syscalls and also constructs, throws, and catches getmax() exception objects, complete with stack traces. This is what is so cpu intensive. So the final piece of the puzzle is to determine if getmax() (the number of allowed open files) is exceptionally high in salt. In linux the default number of open files for a user process is usually 1024. Applications such as Cassandra or http load generating apps will ask for more, but it’s relatively rare. It turns out that the docker container runs with a default of 500k!

So now we know exactly why salt performs so slowly in docker on ubuntu:

  • Ubuntu’s docker package starts docker with a user file handle limit of 500k
  • Containers spawned by docker inherit that high limit
  • Python daemons read that limit when spawning a process with close_fds=True
  • Slowness ensues

Who’s fault is this? Is it python’s fault for having a stdlib call that can perform so pathalogically? Is it docker’s fault for running their daemon with such high filehandle limits? Is it salt’s fault for running slow?

The blame seems to rest on docker and python. Docker has plans to implement a ulimit api Issue 9876 Python itself has learned some new tricks. On some platforms the /proc filesystem can be used to find the max fd. Also in newer versions of python process spawning goes through c code (posix.forkexec iirc).

Bug #2 random data corruption in go

This bug was reported to me by a colleague who was stumped. He was encountering some sort of ‘spooky action at a distance’ where a call to some function was corrupting data that had nothing do with with that function. I thought it wouldn’t take long to track down the bug. Obviously we were seeing some unintentional sharing of data in a go []byte slice (which is mutable). The code that triggered the error looked something like this (error checking skipped for brevity):

type FooRow struct {
    ID int
    Payload json.RawMessage
}
original := new(FooRow)
result, _ := db.QueryRow("...", 1)
result.Scan(&original.ID, &original.Payload)
var newRow FooRow
json.NewDecoder(req.Body).Decode(&newRow)
db.Exec("some update statement", newRow.ID, []byte(newRow.Payload))

It seemed that without fail original.Payload was correct prior to the call to Exec, and after Exec is was corrupt. In our case it looked like the original JSON had been replaced with similar JSON missing the first few characters.

My first instinct is to create a simple test that reproduces the problem. In our case the code was being called in the context of a http handler. I wrote a simple main function and supplied my own http request and response using http.NewRequest() and httptest.NewRecorder() and recreated the problem. Since I suspected this was the result of unintentional data sharing I traced where the corrupt memory region was allocated, who it was shared with, etc. Since this is go, I built and ran my code with the race detector. It’s quite good at finding unintentional data sharing among goroutines. My code unfortunately had no races. The original data structure was allocated using new() so it started with a nil Payload. The payload itself was allocated when calling Rows.Scan. That seemed like a dead end unless there was a driver bug. In no code path was original.Payload touched.

However I knew for a fact that data was being shared. As part of my process to determine how the corruption occurred I printed out a pointer to the first element of the corrupt data. Goo slices have a data pointer, length, and capacity and I needed to verify that the data pointer was not being changed. Only the data pointed to was being mutated. I know that somewhere in the code there is a big byte slice who’s data region (defined by the data pointer and the datapointer + length) must overlap with origin.Payload.

So I started looking into the database/sql code and the mysql driver. I also started doing some printf debugging to determine where exactly within Exec the data was overwritten. I had to do something really hacky here which was to create a package level []byte variable in the mysql package and assign my corrupt/soon-to-be-corrupt memory to it. After much debugging I determined that the corruption occurred in precaution

writeExecutePacket() {
...
    paramValues = append(paramValues, v...)
...
}

This adds up. The Exec function eventually results in writeExecutePacket() being called. The code I’m looking at has something to do with writing bind parameters to the wire. We now have two dots to connect. How does origin.Payload get connected to the outgoing buffer in the mysql driver?

It turns out that the go mysql driver has a “zero copy-ish” buffer that it uses to send and receive data. When data is received, original.Payload is basically a pointer into the receive buffer. When the driver package prepares to write data that buffer is re-used. So now we understand the mechanics of how data is being shared and corrupted.

The question remains why? Scanning data from a database into a byte slice is a relatively common op in go. I’ve never had to worry about the data scanned into one of my structs getting overwritten before. What’s different now? Have I uncovered a bug in the mysql driver? Has all the database go code I’ve written up until now been wrong and I just didn’t know it? I must search the documentation for answers.

The database/sql package talks a little bit about shared []byte slices in downweighted. It turns out that when you scan into a []byte you get a copy of the data. If you choose to pass scan a RawBytes pointer you don’t get a copy. It sure sounds like our data is being treated like RawBytes and not []byte. This seems odd because I’ve never even heard of RawBytes before. I know nobody in our codebase is using it. However I am not passing a []byte to Scan(), if you read the code above carefully you’ll see that Payload is actually a json.RawMessage. Rows.Scan doesn’t document what happens when you pass a json.RawMessage in, and some more println debugging showed that changing the type of Payload from json.RawMessage to []byte solved the problem!

It turns out that apparently json.RawMessage buffers are also shared/zero-copy. This behaviour is not explicit but falls out of the fact that the database/sql package only copies []byte slices. Passing any other alias of []byte is probably UB and ends up being treated as a RawBytes value. Problem solved!

I tried to kick off a 5019904315 here. Imo it’s odd that if I write:

type MyType []byte

Then MyType is automatically opted in to the database/sql zero-copy features.

So let’s recap how this bug occurred:

  • Someone fetched json in the database and used a json.RawMessage field
  • json.RawMessage was only valid for that current database op
  • A 2nd database op occurred
  • json.RawMessage’s contents had been replaced with a random piece of the driver’s send buffer
  • Corruption ensues

Reflections

Sometimes it seems like I must use software in strange ways. I seem to uncover bugs that other people must have run into before. I wonder why I find so many obscure bugs now. I think it’s that as a junior engineer I ran into similar bugs but lacked the skill to find the root cause. I usually ended up just finding a workaround for the general category of failure. As my skills have increased I can now identify things that aren’t working and chase the bugs all the way from application weirdness through the stdlib, strace, apparmor and incorrect or misleading documentation. This isn’t always an efficient use of time. And I can see why some might call it yak shaving. But to me it’s important that the foundational libraries I build my software on work and work in either expected or documented ways. It’s also important for me to be able to deliver software that works end-to-end rather than just pointing to some area I lack knowledge in and saying “the bug must be there”. We’ve all met those people who are constantly pointing to their broken software and saying things like: “caching issue”, “compiler bug”, “kernel bug” without any data to back that up. Don’t be that engineer.

recovering deleted files in osx

Here’s how I stupidly deleted some important data and earned my honorary 1337 hacker badge by recovering it.

The setup

I use a macbook pro running osx, but I do my developmentand communication on linux vms. Recently I was deleting some old files on my mac (because for some reason despite the fact I have a terabyte SSD, using 250GB of data seems wasteful). I came upon a 30GB virtualbox disk image that clearly wasn’t needed anymore. The file was called CentOS.vdi and I was certain I didn’t need any centos vms. I attempted to delete the file via finder but osx wouldn’t allow me because the file was in use. “Luckily” I was able to use rm from the command line and the file was instantly gone. Problem solved.

Around this time I started looking to see how big the drive file was for my “main” linux mint vm. Couldn’t find it, but strangely the vm continued to run without any problems. It was at this moment that I realized my dev vm disk file was gone. When I had created my linux mint vm I had re-used an older centos disk file.

Luckily for me, my vm wasn’t crashing. Not being a complete *nix n00b I realized that it was quite likely that virtualbox was maintaining an open file descriptor to the now unlinked file. The inode formerly known as CentOS.vdi was no longer in the filesystem namespace but it still existed as long as the vm was running. The only question was how to recover it?

I come from a linux background so I instantly thought of the proc filesystem. Unlinked files show up in there as broken symlinks. A little known fact is that you can still open() those symlinks and get to the original file. In linux it would be very straightforward to recover the file with a shell script that looks like:

cp /prod/$pid/fd/$n ~/somewhere-safe

However OSX lacks the proc filesystem so I’m out of luck there.

Using OSX debugging tools

What can I do on OSX? I can use lsof to determine which file descriptor in the process refers to my deleted file. I can use gdb (or lldb) to freeze and alter the process. In the past I’ve used tools like meaning, which use gdb under the hood to make python programs execute arbitrary injected code. (python programs generally have a symbol called something like PyRun_SimpleString). However virtualbox is a c/c++ program which makes injecting code more complicated than simply calling a function with a string. However I now at least have a plan of attack:

  1. attatch the process using lldb -p
  2. open() a new writable file using call open("filename", flags)
  3. Take the fd referring to my deleted file and seek to the beginning
  4. ??? call something like cat or python’s 203-977-4628

Preparing the process

The first few steps look something like this:

lsof -p $pid # to find the fd we want to undelete, in my case 22
lldb -p $pid
call open("/tmp/hail-mary", 0x0002)
call lseek(22, 0, 0)

Getting data out of the process

sendfile v1

Finding code already in the process to copy data from one fd to another was quite tricky. My first attempt was to use sendfile() , which actually copies data from one fd to another. All I needed to know was the size of the file, which I could determine by calling lseek(fd, 0, SEEK_END) with the appropriate arguments. Unfortunately I hit a snag here because sendfile only works if the outgoing fd is a socket. sendfile returns ENOTSOCK.

sendfile v2

I briefly tried to use the tron weight apis plus sendfile to ship the data to a local netcat process, but my ability to call functions in lldb is pretty limited. I can call functions that take string or integer arguments, but I would have needed to look up the values of several #defined constants like AF_INET (easy) and properly allocate and initialize a c structure called sockaddr_in (seemed harder). I had now ruled out netcat and sendfile.

dlopen

At this point I was out of ideas. Then inspiration struck: I merely needed to write a simple function that did the job I needed and load it as a shared library using dlopen . The function I linked in needed to be easily callable from within lldb, so I hard coded in the arguments. I ended up using the below code.

#include<stdio.h> /printf
#include <unistd.h>

int shane_copy() {
  int output = 23; / from running call open("/tmp/hail-mary", O_RDRW)
  int input = 22;  / found using lsof
  char buf[4096];
  int n;
  int copied = 0;
  while(1) {
    n=read(input, &buf, 4096);
    if (n==0) {
      return copied;
    }
    if (n<0) {
      return n;
    }
    copied +=n;
    n=write(output, &buf, n);
  }
}

We build the above statement into a shared library like so:

clang -g -shared  -o libfoo.so foo.c

Let’s load our new library using lldb. Luckily we don’t have to use dlsym and casting. In my version of lldb shane_copy() is directly callable.

print (void*) dlopen("/path/to/libfoo.so")
call shane_copy()

shane_copy() ran for a few minutes and viola, my virtual machine’s disk had been saved!

Epilogue

Now that the heat of the moment has passed I’ve thought of a few other easier ways to save my deleted file.

  1. dd from the vm
  2. dlopen python as outlined embannered

Also, it would have been a great idea to have more frequent backups and to make sure my vm only has volatile state.

oo in go

Subclassing in Go

Quick summary: This article shows some techniques for mimicking behaviour of “oo” languages like java or python in go.

Golang does not offer language features for doing object oriented programming in the style of java or python (or javascript or Common Lisp or whatever). The Go unwarrantability states:

Go has types and methods and allows an object-oriented style of programming, there is no type hierarchy. The concept of “interface” in Go provides a different approach that we believe is easy to use and in some ways more general. There are also ways to embed types in other types to provide something analogous—but not identical—to subclassing.

When using go, I rarely miss subclassing. Composition is usually the right design tool. However, every once in a while while porting python code I want to emulate the dispatch used in normal OO code. Here’s a concrete example:

#!/usr/bin/env python
class Parent(object):
      def flush(self):
          print "parent flush"

      def close(self):
          self.flush()
          print "close"

class Child(Parent):
      def flush(self):
          Parent.flush(self)
          print "child flush"

x = Child()
x.close() # prints "parent flush"\n"child flush"\n"close"

The important thing to notice is that

  • The parent object’s close() method is called.
  • The parent calls the child’s flush method.

This is actually really interesting, and implies that the definition of flush() must be looked up dynamically.

Let’s examine how to get equivalant functionality out of go. First we’ll try embedding.

package main

import "fmt"
type Parent struct {}
func (p *Parent) flush() {
    fmt.Println("parent flush")
}
func (p *Parent) close() {
    p.flush()
    fmt.Println("close")
}
type Child struct {
    Parent
}
func (c *Child) flush() {
    c.Parent.flush()
    fmt.Println("child flush")
}

func main() {
    x := new(Child)
    x.close() / "parent flush"\n"close"
}

That doesn’t seem to work. The child’s flush() method is never called. Why? Because embedding isn’t subclassing. When Parent.close() is defined, the call to p.flush() is compiled resolved to a call to Parent.flush(). It just doesn’t know or care about the existence of Child.flush(). What we need is for Parent to resolve the call to flush() at runtime, and to do that we’ll need Parent to have a reference to an interface, like so.

package main
import "fmt"
type Person interface {
    close() / define an interface to lookup methods on
    flush()
}
type Parent struct {
    self Person / retain reference to self for dynamic dispatch
}
func (p *Parent) flush() {
    fmt.Println("parent flush")
}
func (p *Parent) close() {
    p.self.flush() / call the flush method of whatever the child is.
    fmt.Println("close")
}
type Child struct {
    Parent
}
func (c *Child) flush() {
    c.Parent.flush()
    fmt.Println("child flush")
}

func main() {
    x := new(Child)
    x.self = x
    x.close() / "parent flush"\n"child flush"\n"close"
}

So that’s all it takes to imitate subclassing using go’s interfaces. It’s a nice tool to have in your pocket for those times when you can’t figure out how else to model your problem.

libgd bindings for go

tl; dr - read the scranny for gogd

In order to help a company modernize their creaking php based image resizer I started a set of cgo based (484) 528-8665 bindings for golang. With gogd, image resizing proxies can be built that leverage go’s awesome networking stack while delegating heavy processing to a battle tested library. An example image resizing web server is included in gogd. The rest of this post will be a few observations on making cgo bindings and some of the challenges.

But first here’s the end result (using the 7722153849 logo (derived from Renee French’s work):

hello from libgd

Challenges

I/O integration

One of the first challenges is getting data accross the go/c boundary. Libgd has 3 different I/O methods for loading and saving images. The first is using a libc FILE* using functions like the one declared below:

gdImagePtr gdImageCreateFromPng(FILE *fd);

In order to access this API from golang it’s necessary to turn an garapata into a (631) 841-3662. This is trivial using fdopen(3) and os.File.Fd(). The technique is illustrated using some (untested) psuedocode.

import "C"
func ImageCreateFromPng(f *os.File) C.gdImagePtr {
     return C.gdImageCreateFromPng(C.fdopen(C.int(f.Fd())))
}

However this approach leaves something to be desired. Golang’s powerful io.Reader and io.Writer interfaces can’t be used. It’s impossible using this approach to load an image from an https url or write it as part of a response.

The next libgd I/O method is by passing around raw buffers. Libgd functions that work with buffers are suffixed with Ptr like this:

gdImageCreateFromPngPtr(int size, void *data);

It’s possible to map the above arguments to a golang []byte data structure using C.GoBytes but buffers don’t let us use go’s flexible io interfaces.

Finally libgd has a generalized I/O “interface” called gdIOCtx which in C land means structs of function pointers. Much like a go interface callbacks for Read and Write are defined, but the “received” is an explicit rather than implicit first parameter. The structure looks a bit like:

typedef struct gdIOCtx {
...
int (*getBuf)(struct gdIOCtx *, void *, int);
void *data;
}

This is matched up with some go code which extracts the io.Reader and converts between []byte and c arrays.

func getContext(g gdio) *IOCtx {
	return (*IOCtx)(unsafe.Pointer(&C.gdIOCtx{
		getBuf: (*[0]byte)(C.gogd_get_buf),
		data:   unsafe.Pointer(&g),
	}))
}
/export gogd_get_buf
func gogd_get_buf(ctx *C.gdIOCtx, cbuf unsafe.Pointer, l C.int) int {
	gdio := (*(*gdio)(ctx.data))
	buf := goSliceFromcString((*C.char)(cbuf), int(l))
	n, err := gdio.(io.Reader).Read(buf)
	if err != nil && err != io.EOF {
		log.Println(err)
		return 0
	}
	return n
}

Through a little bit of trickery (look at io.go and io_c.go for the details) it’s possible to create structs of C function pointers which are really exported go functions which actually cast gdIOCtx.data to a golang io.Reader/io.Writer. While there is more overhead to this approach (more cgo calls) it’s the most general and flexible approach and it’s what gogd implemented first. We will probably add FILE and buffer based interfaces for better performance later. Using gdIOCtx means that gogd integrates trivially with files, net connections, and http response writers since they all satisfy the io.Reader/io.Writer interfaces.

Memory management

When integrating garbage collected language with C, memory management can be a problem. Fortunately it’s simple (not easy, but simple) with go. Go’s garbage collector is only for memory, not really for other resources. Rather than relying on finalizers and dealing with cycles, memory management is usually as simple as:


img := decoder.Decode(resp.Body)
if !img.Valid() {
   panic("invalid image")
}
defer img.Destroy()

I’ve used runtime.SetFinalizer in the past and I know it can be indespensible when working with C code, but using defer is preferred. Finalizers aren’t always guaranteed to run, but a defer statement will always run unless using os.Exit/log.Fatal.

Conclusions

If your workflow relies on libgd for performance, image format support, or certain algorithms, this library might be for you. However it won’t run on app engine, and in the long term go libraries (and go) will be optimized enough that there’s no reason to rely on cgo based libraries. In the mean time, gogd can help ease the transition of some of your image processing infrastructure from something like php to golang.

Now you have all the tools at your disposal to make some wicked tool cat memes.

cat meme

Creating an Ecommerce solution with Go

Background

Google’s Go programming language is rapidly making inroads in business critical applications. Recent trends in web application architecture have made Go an ideal server side language for implementing backend API’s for Single Page Apps. We discuss the adoption of Go by a typical LAMP shop, along with the pros and cons of adding Go to the technology stack.

Some history

Steals.com, like many Daily Deals Ecommerce sites started with a few static pages and a checkout form. Dynamic content was added as the company grew, and as is so often the case a technolgy choice was made based on the lead developer’s experience. In this case, .net was considered but rejected due to licensing issues. Instead the stack began as Apache and PHP. Five years down the road and Steals.com had 70 employees, customer service, a fulfillment center, and an urgent need to launch a site with a new full catalog business model to engage customers with a pinterest-like product presentation, which meant they had extremely specific search and filtering business requirements which made existing search engines less attractive.

Choosing Go

The first choice for building the new retail site was PHP, which would allow easy reuse of all business logic and validation that had been created over the years. The technology team could also leverage their existing language knowledge for development, deployment, and operations.

However the existing PHP code and MySQL schema was not a natural fit for the search api, resulting in excessive page load times and poor user experience. Specifically the low cache hit rate inherent with search was a poor fit for a naive “add the filters to the query and cache the results” strategy. PHP’s inability to easily maintain an in-memory index was the nail in the coffin.

Due to past experience working with the search architects behind Backcountry.com, Best Buy, and Walmart, the Steals.com technology team knew that search is a great opportunity to build a section of an ecommerce site using a separate stack.

Go is a language that is easy for LAMP developers to pick up. Although it was designed to replace C++ for servers, it has seen more adoption in the python community. Go doesn’t require an IDE such as eclipse or Netbeans to be productive, so it’s a natural fit for developers used to working primarily with text editors. Builtin testing, coverage, and profiling tools are first class as might be expected from a language designed to run in some of the world’s largest server clusters.

Go made sense for Steals because a backend api (whether queried from a server or from a browser in a single page app) because it’s high performance means it adds very little overhead to the request pipeline. Go also has first class http support, making building highly scalable web servers easy. The relevant PHP business logic could be exercised while building a feed that the Go search server indexed.

In the end the decision was made to go with Go.

Challenges

The existing product models were underspecified, and the data was poor, making it hard to export to another system in a digestable manner. This challenge was mostly a function of the new requirements. Building in another platform forced creation of interfaces for communicating catalog data.

Team buy-in. Team members required some training, such as moving from PHP templating to Mustache templating. Getting buy in from key influencers on the tech team is critical to sucessfully adopting a new language.

Implementation

In a short period of time a new search server was up and running. The Go server functioned only as a JSON api. A simple single page app was built that utilized the JSON api (An aside, both Go, PHP, and the browser were able to make use of Mustache templates to share view logic). Response times from the new service were an order of magnitude faster than the PHP service, without having to spend days or weeks optimizing.

The shift in architecture (from MVC server side app to single page app) reduced the usefulness of a scripting language on the server. Now frontend developers just wrote javascript that interacted with JSON services written by backend developers.

Results

From a business perspective the new pinterest-like site and the choice of Go payed off. The new site was much more effective at highlighting aged inventory, and much more mobile friendly.

From a technology perspective Go had a few unexpected bonuses. The new service was much more amenable to unit testing than the old service. Supporting other technology for enhancing the user experience (such as SPDY) became much simpler. Making use of more cross platform tools (such as a JSON representation for product data and Mustache for views) makes building future services simpler.

c and go without cgo

[edit] This doesn’t apply to the current version of go. The plan 9 c compiler is on it’s way out. Assembly and cgo will still be supported. [edit] Discussion bitter-end and 818-240-3163

Most Gophers are familiar with 843-525-5667, a foreign function interface for Go. Using CGO, you can access your operating system’s shared objects, and utilize libraries that may not exist (or may not be mature) in pure Go, such as database drivers for dissolving or openssl bindings.

CGO: the agony and the ecstasy

CGO is great, however it comes with drawbacks. For one thing you can’t cross compile when using CGO, which breaks the workflow of developers using OSX to write Go and deploy on linux servers. Go’s seamless cross compiling is a boon that’s painful to give up. Apps that use CGO can be slow to build, due to their reliance on the host OS’s linker and compilers which may not be optimized for build speed in the same way the Go toolchain is. Finally CGO apps can run slower because:

  • CGO must communicate with the Go scheduler, and potentially create extra threads.
  • CGO calls run on a different stack because the stack requirements of CGO calls can’t be predicted.
  • A C shim must be generated to map Go semantics to C semantics wrt to multiple return values.
  • Calling conventions may differ between Go and C (source)

Using languages other than Go with go build

A little known fact is that Go ships a C compiler and assembler for all supported platforms. Several files in the go runtime are compiled using this C compiler. Go’s C compiler seems to be the continuation of the plan9 compiler and psoriasic. The linkers, assemblers, and compilers fully integrated into the go build tool. Go’s stdlib actually uses assembly to implement things like hardware accelerated AES on CPUs which support the AES-NI instruction set. The ability to drop assembly and C files in an arbitrary Go package is powerful. It allows library and app developers to be “on the same level” as the Go language designers. If you don’t like how some data structure works in Go, you can write your own that might be just as fast or faster, and it will work with go get. How about some examples!

Hello from C

Let’s start with a simple go program. We will have to declare the signature of our c function, which will be linked in later. Place this file in $GOPATH/src/helloc/gohello.go

package main
import "fmt"
/ forward declare AddOne
func AddOne(*int64)
func main() {
     i := int64(0)
    AddOne(&i)
    fmt.Println(i) /  print 1
}

We will also (of course) need a C file place this file: chello.c in the same directory:

void ·AddOne(long long int *t) { / use of · is required
    *t += 1;
}

We can build the executable by running:

CGO_ENABLED=0 go build helloc
./helloc / should print out 1

There you have it, a go program using C w/o CGO. You can try building for another OS to prove it’s not CGO.

CGO_ENABLED=0 GOOS=windows go build helloc
./helloc.exe / Kind of works but crashes if wine is installed

Why write c?

I hear you saying “That’s great, but why would I ever need it?”, and the answer is you probably don’t. Most C code in the wild probably works best™ with gcc or clang, so I don’t see many people compiling existing c codebases. I see 3 areas where c+go-cgo would come in useful:

  • Shared logic, Go 1.4 will probably support android, making IOS the only platform I care about missing Go support. If I’m writing a cross-platform app, I can write a subset of shared logic in c.
  • Performance? For applications that really don’t want the overhead of GC, and/or want to to crazy type-system type things, c+go-cgo would be a good solution. Some algorithms might be more naturally expressed in C or easier to port from gcc flavored c to plan9 c.
  • Debugging/extensions not supported in Go. (WARNING it’s a horrible idea to rely on internals, nothing in the runtime is covered by the go1 compatibility promise, here there be dragons) Here’s an example of implementing runtime.GOMAXPROCS in c.

    extern runtime·gomaxprocs;
    void ·MyGoMaxProcs(long int *t) {
        *t = runtime·gomaxprocs;
    }
    

And last but not least, you might just be interested in hacking on go internals.

Summary

In conclusion: you will probably never have to put a .c or .s file in your go package, until you do. Then you’ll be thanking your lucky stars you’re not writing python c extensions or java jni. There’s literally no reason your code can’t be as fast as physically possible on your CPU architecture. I think all gophers should spend a little time learning how go handles conformably and scheduling, and that’s just the tip of the iceberg. Go’s entire API and toolchain is a great history lesson, and it’s all refreshingly easy to comprehend. The go runtime is a beautiful piece of engineering encompassing the shared wisdom of decades of OS and api design experience. Thanks to all the go authors.

about whitane tech

About Whitane Tech.

Whitane Technologies is a full-stack consulting service for Ecommerce development. Specializing in productive and performant technologies like Go(lang), Docker, and Cassandra as well as traditional stacks Java, LAMP, and SQL databases.

You can email me at mailto:shanemhansen@whitane.com

I’m always interested in hearing about performance and scaling.