Featured image for “Technology behind last week’s 1.5 Million trees”

Technology behind last week’s 1.5 Million trees

If you’ve followed our blog recently, you know that we raised over 1.5 Million Euros/Trees in less than a week during the “Waldrekord Woche”  campaign by the German TV station SAT1.

After crossing 1 Million Trees on Thursday, March 18, an additional half a million trees were donated on Friday alone. (While the current counter still reads 1.48x Million, pending donations through SEPA/SOFORT, that take up to 5 days to finalize, are included in this calculation.)

This was an incredible campaign for us, and we are very grateful for the support.

As we had some technology hiccups during the campaign, here I will share what powered this campaign, what went right and what went wrong, and some of our plans to make such technology open, and what we plan to handle even bigger campaigns.

#1 The WebApp for Running Campaigns

The site waldrekord.plant-for-the-planet.org/home is a customized version of the Plant-for-the-Planet WebApp included in an Iframe. It includes all the technology available on our site and is no different from salesforce.com/trees or stern.de/baeume. You can read more about our web app here.

To power more campaign sites like these, we are working on a Payment gateway (Donate with Plant-for-the-Planet) to do the heavy lifting of payments, accounting, and tax deductions for the users and Tree Planting Organisations. This will be available free of cost for Tree Planting Organizations in Q2 2021.

#2 Donations

The Donations in the SAT1 Campaign was powered by Stripe, PayPal, and our own TreeCash Gateway (in Beta). Users could donate with Credit Cards, PayPal, SEPA, GiroPay, SOFORT, and SMS. A week before the campaign, we enabled SOFORT and GiroPay for all projects using our default Payment Setup.

We also implemented automated Tax Receipts in the US and Germany for donations processed by Plant-for-the-Planet. We will send tax receipts for eligible donations/donors right after the donation is completed starting April. To aid tax agencies and donors, we’ve also created a site to check the validity of tax receipts at any time here.

At the end of February, we offered Tree Cash Gateway for few companies that want to integrate microtransaction with Trees. Tree Cash can be integrated into shops, e-commerce platforms, or any checkout flow, and it has already processed over 15k transactions with a 100% success rate. If you are interested in TreeCash, please contact [email protected]
We will share more info on TreeCash in the coming days.

As of March 22nd, over 100k donations were made in the SAT1 Waldrekord Campaign.

#3 The site crashed …

To be precise, at times, our apps got very slow and became inaccessible for most of our users. We were on television during peak hours. 😊

Plant-for-the-Planet App usually handles around 500-4,000 concurrent users with 100ms response time. And according to Cloudflare Radar, it is a top 100k most frequently visited domains in the world.

Most of our APIs facing the users are cached for few minutes (e.g., Projects, PaymentSetup, etc.). Others are dynamic content for logged-in users that do not require much computation, e.g.: (loading a user’s profile, competition, etc.).

However, two components of the Campaign (the Leaderboard and the Counter) were dynamic. We use Redis to cache both leaderboard and tree counter, but it was cleared after every successful donation. Even though we managed to scale our PHP Application horizontally to use over 20x our available resources, with over 20 concurrent donations per second, the Redis cache invalidation made our cache useless.

As a result, the MySQL database had to compute both the recent leaderboard and the total count for full traffic from the entire donation table and around 20k+ concurrent users. This led the database CPU to hit 100%. Our base performance is at around 5% CPU. In the process, it threw 5xx errors and dropped over 90% of the requests.

Slow down on March 16 vs change in Redis Cache Invalidation
Newrelic showing transactions during the slowdown on March 16

As the API for the leaderboard and total count threw 5xx errors, the Next.js application handled those errors using a redirect, which ultimately threw 404.

To mitigate these issues, we implemented few steps during the event to prevent another crash.

  1. We added error handling to avoid redirections from API errors. This now prevents 404 errors when API fails.
  2. We also changed the Redis cache invalidation strategy. Now we cache results for x seconds regardless of donations received. After trial and error, we have our cache at 14 seconds for the moment. Even though this was helpful initially, once the cache expired, new requests in the time that it took for the cache to be generated were more than enough for the database to cross 70% CPU. So we opted to add Cloudflare Cache on top.
  3. Use Cloudflare Workers to cache on the Edge.
    We love Cloudflare at Plant-for-the-Planet. We’ve used Cloudflare tools in almost all our applications and have several use cases of workers. We used workers to cache the results on edge for 15 seconds using Cache-Control s-maxage=15 and added stale-while-revalidated=60 for 60 seconds. This allowed Cloudflare to serve stale results for up to 1 minute while retrieving the fresh results from the origin server every 15 seconds. We came up with these numbers after trial and error as well.

    There is still some cache leakage when many requests come from Cloudflare edge with little traffic, but this should absorb the load fairly quickly once the traffic is stable.
Leaderboard Endpoint Cached by Cloudflare Workers
Corresponding CPU time where leaderboard and Tree count is cached

During our several downtimes, we found two donors who were charged multiple times. We found these charges early and have already resolved the issues with the donors.

Takeaways:

In the v1.0 of Plant-for-the-Planet apps, Leaderboard and Tree Count were served by Elastic Search. This allowed results to be served fast, with minimal load on the server/database. As we worked on our v2.0 of the WebApp, we delayed our plans to improve the Leaderboard 2.0 for Q3 of 2021 and chose an easy-to-use solution for the time being. This was one of the main reasons that caused these issues.

We are continuously making improvements to our products, and as we plan to increase our offerings to other organizations, it is important for us more than ever to maintain stability.

We would not have been able to scale like this without our incredible supporters like Heroku, and Cloudflare who power the Plant-for-the-Planet Applications for the Trillion Tree Campaign.

Whether you are a student or a professional, if you have experience with Symfony framework or React and are looking for new challenges, find us on Github and say hello. We are often looking for curious minds to join our team.