Archives for : May2014

Making Code the Universal Language

In 2013, I landed in Shanghai for a few meetings. My first few minutes walking around the city led to a conversation with a stranger I met in a bar after his long day of work. We discussed life in Shanghai, where he lived, how long he’d been in Shanghai, and what he did for a living. He told me he had just landed a new job with a programming consultancy in Shanghai, and said it all started with a website— Codecademy.com — through which he’d learned to code.

This scene repeated itself months later in Dublin, where I was meeting James Whelton of CoderDojo. We talked about programming education and noticed the couple next to us were talking about programming too. It turned out that they were both Dublin-based programmers – he for Facebook and she for a software consultancy. She talked about how she was planning to leave her job writing Apex (for Salesforce) to take a job writing Ruby, which she had just learned on Codecademy.

These stories aren’t unique — in fact, they’re a reality for most Codecademy users, 70% of whom live outside the United States. From the beginning, we’ve watched with amazement as Codecademy spread. The day we launched, we expected traffic to die down overnight in California, but we hadn’t taken into account that people were just signing on in other parts of the world. Since then, we’ve kept our global audience in mind with everything we’ve built, realizing that the power of an education transcends borders beyond the city we build Codecademy in and the language we speak.

Codecademy: Bringing Skills to You, Wherever You Are

Today, we’re bringing easy access to a world-class skills education to even more people across the world — hoping they’ll benefit from Codecademy in the same way that our more than 24 million existing learners have. We’ve worked to translate Codecademy to Spanish, Portuguese, and French, with more languages on the way. But that’s not all — we’re working closely to create communities and become embedded in new countries to help new learners all over the world become empowered with the skills they need to succeed in the 21st century. We’ve got amazing partners to help us bring Codecademy to five new countries (along with those that speak their languages!).

United Kingdom

The UK made news as the first G8 country to mandate programming education for all primary and secondary schoolers. We’ve worked hand-in-hand with many organizations in the UK over the past few years — sponsoring Code Club as they bring programming education to after school groups, working with the Computing At School network to help connect teachers with resources, and with the government itself to bring programming to classrooms — and we’re now doubling down on our commitment to the UK by opening our first international office in London, headed by Rachel Swidenbank.

France

Libraries Without Borders (Bibliothèques sans Frontières) has worked tirelessly over the past few years to expand access to literacy across French-speaking countries, among them Haiti, Cameroon, and others. Today, Codecademy is working with Libraries Without Borders to translate Codecademy into French and to implement pilot programs to reduce unemployment and including programming in schools. In addition, Codecademy will be a component of the recently announced Ideas Box (designed by Philippe Starck), a project that will be deployed in refugee camps and disaster zones across the world to empower individuals with the skills to improve their lives. Grants from the public and private sector in France helped to make all of this possible.

Brazil

The Lemann Foundation is the largest education foundation in Brazil, funding innovation on the K-12 level and elsewhere by fostering innovation inside the country and by bringing international technological developments to students. Codecademy is available in Portuguese today thanks to close work with the Lemann Foundation and will soon launch in several Brazilian pilots. One of our proudest moments was talking to Brazilian teachers a month ago in São Paolo about today’s launch of Codecademy in Portuguese, their native language.

Argentina and Buenos Aires

The Government of Buenos Aires, led by Mauricio Macri, has made an ambitious commitment to bringing skills and programming education to all of their citizens by working together with Codecademy. Jorge Aguado, the head of educational technology for the City, has worked to make sure that Buenos Aires is one of the first cities in South America (and the world!) to make a statement about its digital future, tying programming into every school in Buenos Aires, pursuing a campaign to provide skills to the unemployed, and to train government workers with technology. Both we and the government of Buenos Aires think this is the first commitment of its kind in the South American region and think it’s a terrific template for other cities (and governments) moving forward. Buenos Aires’ commitment is particularly notable given that its Spanish translations will be available to the entirety of the Spanish speaking world.

Estonia

Estonia’s Tiger Leap program has helped it become one of the most advanced digital economies in the world. We hope to support this commitment by working with the Estonian government to help every Estonian K-12 student learn to program.

Codecademy Speaks Spanish, Portuguese, French, and more!

It’s often said that code is the “language of the 21st century.” We at Codecademy think that code is a language that’s cross-border and truly international, and that our new work internationally is an essential step towards bringing advanced digital skills to people all over the world. We can’t wait to hear stories from the millions of new Codecademy learners to come and from the additional partners we’ll be announcing soon!


Building education for the world isn’t easy — technically or from a product perspective. Want to work on projects like this? We’re hiring!

EventHub: Open-sourced Funnel analysis, Cohort analysis and A/B testing tool

As product development becomes more and more data driven, the demand for essential data analysis tools has surged dramatically. Today, we are excited to announce that we’ve open sourced EventHub, an event analysis platform that enables startups to run their funnel and cohort analysis on their own servers. Getting EventHub deployed only requires downloading and executing a jar file. To give you a taste of what EventHub can do, we set up a demo server to play on located here – demo server with example funnel and cohort analysis queries.
EventHub was designed to handle hundreds of events per seconds while running on a single commodity machine. With EventHub, you don’t need to worry about pricy bills. We did this to make it as frictionless as possible for anyone to start doing essential data analysis. While more details can be found from our repository, the following are some key observations, assumptions, and rationales behind the design.

funnel screenshotcohort screenshot

Architecture Overview

Basic funnel queries only requires two indices, a sorted map from (event_type and event_time) pair to events, and a sorted map from (user and event_time) pair to events

Basic cohort queries only requires one index, a sorted map from (event_type and event_time) pair to events.

A/B testing and power analysis are simply statistics calculation based on funnel conversion and pre-determined thresholds

Therefore, as long as the two indices in the first bullet point fit in memory, all basic analysis (queries by event_types and date range) can be done efficiently. Now, consider a hypothetical scenario in which there are one billion events and one million users. A sorted map implementation like AVL tree, RB tree, SkipList, etc. can be dismissed as the overhead of pointers would be prohibitively large. On the other hand, B+tree may seem to be a reasonable choice. However, since events are ordered and stored chronologically, sorted parallel arrays would be a much more space efficient and simpler implementation. That is, the first index from (event_type and event_time) pair to events can be implemented as having one array storing even\t_time for each event_type and another parallel array storing event_id, and similarly for the other index. Though separate indices are needed for looking up from event_type or user to their corresponding parallel arrays, as event_type and user are level of magnitude smaller than events, the space overhead is negligible.

With parallel array indices, the amount of memory needed is approximately (1B events * (8 bytes for timestamp + 8 bytes for event id)) * 2 = ~32G, which still seems prohibitively large. However, one of the biggest advantage of using parallel arraysis that within each array, the content is homogeneous and thus compression friendly. The compression requirement here is very similar to compressing posting list in search engines, and with algorithm like p4delta, the compressed indices can be reasonably expected to be <10G. In addition, EventHub made another important assumption that date is the finest granularity. As event id is assigned monotonically increasingly, the event id itself can then be thought of as some logical timestamp. As EventHub maintains another sorted map from date to the smallest event id on that date, all the queries filtered by date range can be translated to queries filtered by event id range. With that assumption, EventHub was able to get rid of the time array and further reduced the index size by half (<5G). Lastly, since indices are event_ids stored chronologically in an array, plus the array is stored as memory mapped file, the indices are very friendly to kernel page cache. Also, assuming most of the analysis only cares about recent events, as long as those tail of the indices can fit in the memory, most of the analysis can still be computed without touch disks.

At this point, as the size of the basic indices is small enough, EventHub would be able to answer basic funnel and cohort queries efficiently. However, since, there are no indices implemented for other properties on events, in case of queries filtered by event properties other than event_type, EventHub still needs to look up the event properties from disk and filter events accordingly. Due to the space and time complexity needed for this type of query is not easy to estimate analytically, but in practice, when we ran our internal analysis at Codecademy, the running time for most of the funnel or cohort queries with event properties filter is around few seconds. To optimize the query performance, the followings are some key features implemented and more details can be found from the repository.

Each event has a bloomfilter to quickly reject event property which doesn’t exactly match the filter

LRU Cache for events

Assuming the bloomfilters are in memory, EventHub only needs to do disk lookup for events that actually match the filter criteria (true positive) as well as the false positive events from bloomfilters. As, the size of bloomfilters can be configured, the false positive rate can be adjusted accordingly. Additionally, since most of the queries only involves recent events, to optimize the query performance, EventHub also keeps a LRU cache for events. Alternatively, EventHub could have implemented inverted index like search engines do to facilitate fast equality filters. The primarily reason for adopting bloomfilters with cache is that it doesn’t require adding more posting list as new event properties are added, and we believe for most use cases and with proper compression, EventHub can easily cache hundreds of million of events in memory and achieve low query latency.

Lastly, EventHub as is doesn’t compress the index and we left that as one of our todo. In addition, the following two features can be easily added to achieve higher throughput and lower latency if that’s needed.

Event properties can be stored as column oriented which will allow high compression rate and great cache locality

Events from each user in funnel analysis, cohort analysis, and A/B testing are independent. As a result, horizontal scalability can be trivially achieved from sharding by users.

As always, it’s open sourced, and pull requests are highly welcome.

If you like the post, you can follow me (@chengtao_chu) on Twitter or subscribe to my blog “ML in the Valley”. Also, special thanks Bob Ren (@bobrenjc93) for reading a draft of this

WordPress 3.9.1 Maintenance Release

After three weeks and more than 9 million downloads of WordPress 3.9, we’re pleased to announce that WordPress 3.9.1 is now available.

This maintenance release fixes 34 bugs in 3.9, including numerous fixes for multisite networks, customizing widgets while previewing themes, and the updated visual editor. We’ve also made some improvements to the new audio/video playlists feature and made some adjustments to improve performance. For a full list of changes, consult the list of tickets and the changelog.

If you are one of the millions already running WordPress 3.9, we’ve started rolling out automatic background updates for 3.9.1. For sites that support them, of course.

Download WordPress 3.9.1 or venture over to Dashboard → Updates and simply click “Update Now.”

Thanks to all of these fine individuals for contributing to 3.9.1: Aaron Jorbin, Andrew Nacin, Andrew Ozz, Brian Richards, Chris Blower, Corey McKrill, Daniel Bachhuber, Dominik Schilling, feedmeastraycat, Gregory Cornelius, Helen Hou-Sandi, imath, Janneke Van Dorpe, Jeremy Felt, John Blackbourn, Konstantin Obenland, Lance Willett, m_i_n, Marius Jensen, Mark Jaquith, Milan Dinić, Nick Halsey, pavelevap, Scott Taylor, Sergey Biryukov, and Weston Ruter.