GitHub (Backend Engineer)

Why do you feel you would be a fit for this role?

Putting aside technical skills, to me, it seems like to be successful in this role one has to be able to communicate well in a remote environment. I have been working remotely for a while even before the pandemic and I really enjoy the remote working style. Although communication might get difficult, I have no resistance at all to initiating conversations remotely with colleagues all the time.

What do you think are some of the challenges of working remotely, and how would you address them?

Slower communication. For instance, you get slower review comments especially when you have time zone differences with teammates. One of the ways to address this is to practice pair programming. One approach that has been working for me is “ping pong pair programming” with teammates from different time zones. How cool is that by the time you wake up the next morning, the code you have been working on is already running in production? This kind of practice will also encourage collective ownership!

Tell us about a cool program or project you’ve used or seen that takes both infrastructure design and software engineering concepts into consideration. What is it and why did you think it was really interesting/cool?

In one of our legacy systems, we have a KeyValue DB (DynamoDB) -> MessageBroker (DynamoDB Stream) -> KeyValue DB(DynamoDB) system that allows us to query how many times a user has viewed certain advertisement in a certain time. I was able to replace this architecture with a Redis cluster by utilizing a sorted set (data structure supported natively by redis). It let us query the metric for a certain user within a certain time in a fast enough way. Not only has our server's cost dropped, but our architecture gets simpler, we also manage to get a faster response time because we switch to a TCP connection (Redis connection protocol) from an HTTP connection (DynamoDB connection protocol).

Tell us about an application that you helped to develop and maintain that made it to production. What were some scale and system-related considerations you had to work through?

Our platform is integrated with a few external partners to serve ads. One of those partners is the brand safety measurement partner, which makes sure our ads are served to appropriate publishers (for eg: do not serve car-related ads on car accident news pages). Because the data provided by our partner for every publisher does not change often, instead of accessing our partner API for every ad request, we will cache the response after the first request to make sure we could scale to 80k/second ad request. However, to make sure we renew it to not get stale data (for eg: a new page was updated with car accident content), our partner requested us to refresh our cache every 8 hours. We use DynamoDB to cache the response and set the time to live(TTL) to 8 hours so that old data would be deleted. Eventually, data with expired time to live will be deleted but DynamoDB never promised how fast it would be deleted. To make sure we never use expired data we use an active way: when our client reads the data from DynamoDB it will check the validation of the data and if it’s expired (but not deleted by DynamoDB) the client will delete it.

Pick a feature or screen of a developer tool you use and tell us how you'd make it better. What problems do you see with the current approach? What challenges do you anticipate in implementing the improvement? For example, you could choose your favorite editor, git client, or GitHub itself.

I actually sent a pull request to the open source hoping the maintainer could look at it at https://github.com/awslabs/amazon-kinesis-agent . We use AWS Kinesis Data Streams extensively and we rely on amazon-kinesis-agent a lot. If you use AWS SDK you could choose your partition key (which partition to put) but not the agent. In the case of the agent, a random string would be generated by the agent hence our data will be put into a random partition of our streams. In one of our performance analyses, we realized that if our data are put into the right partition of the stream, we could reduce a lot of duplication and hence increase our service efficiency. Unlike SDK, where you could generate the partition key using your favourite programming language, the agent doesn’t have that kind of flexibility. All we have is a config file so I figured out that using regex would be a good way to do this. Let me know what you think =)

Describe a time when you advocated for a change that was important to the community or your team. What was it? Why was it so important? How did you approach the subject with your team? Was it successful? Why or why not?

We use Scala for our few services across different teams but we never have a standard way to profile the application. When it was my first time trying to profile our application to solve performance problems, I realized it was really hard as the JVM ecosystem is huge and people with no experience will have a hard time finding the right tool. Team members used to find their favourite tools and run the profiling/analyzing on the local environment to approach performance problems. This usually doesn’t work well because of how difficult it is to imitate the production level of requests. A lot of team members' local environments do not have enough PC memory to run the JVM heap dump locally. So I decided to find a way to standardize it. Besides trying out 3rd party monitoring services that help to profile easily we decided to give up because most of the 3rd parties will run profiling all the time and it takes a lot of disk space. All we wanted was Adhoc profiling. After trying out a few tools I came up with a tool called async-profiler that works really well in production. I wrote a script to automate the process to run this profiler and upload its result (SVG) to S3 so that everyone could view it easily. To show my team member how easily this tool could be used in our system I decided to organize a live profiling session online. I manage to run the profiler on the production system and share the URL of the result with everyone. By viewing the result, most engineers know which code path is causing our system and they all could solve the performance problem by themselves now! I would say this is way success because my teammates have been solving other performance problems by using this tool =)