The SymPy/HackerRank DMCA Incident
Thank you to Ondřej Čertík, Oscar Benjamin, Travis Oliphant, and Pamela Chestek for reviewing this blog post. Any errors in the post are my own.
On April 19, 2022, for approximately 11.5 hours, the documentation website for the open source SymPy project and the corresponding GitHub repository were taken down as a result of a DMCA takedown notice submitted by WorthIT Solutions on behalf of HackerRank. In this post, I’d like to explain what the DMCA is and how it works, detail exactly what happened with this incident, and go over my views on the incident and some of the lessons we’ve learned as a result.
What is the DMCA?
The information in this section has been drawn from various sources, including:
- “Your DMCA Safe Harbor Questions Answered” Mitchell Zimmerman (2017)
- The EFF’s Unintended Consequences - 16 Years Under the DMCA (2014)
- GitHub’s DMCA Takedown Policy page
The Digital Millennium Copyright Act (DMCA) is a US law that was passed in 1998. It has laid the groundwork for how copyright is enforced in the age of the internet. One of the DMCA provisions that is important to understand is the so-called “safe harbor” provision. The safe harbor provision, roughly speaking, makes it so that websites can avoid liability because they are simply hosting or transmitting copyrighted content on behalf of their end-users.
The safe harbor provision has been absolutely essential to the modern internet. Without it, a website like GitHub could not exist. Imagine if every time someone found their copyrighted content had been uploaded to GitHub without their permission, that they could sue GitHub for damages. If this were the case, GitHub could never operate. The way GitHub works today, it cannot possibly know if every single thing that is uploaded to it is done so legally. The safe harbor provision also makes modern social media websites like Facebook, Twitter, and YouTube possible.
However, in order for a website to have and maintain safe harbor status, it must follow certain practices as outlined by the DMCA. In particular, sites must manage the “notice-and-take-down process”. GitHub’s notice-and-take-down process is described in its DMCA takedown policy.
The notice-and-take-down process works like this: a provider (like GitHub) has a means for a copyright holder to submit a notice. In GitHub’s case, they provide an online form. This notice must include certain information, including the name and address of the submitter, identification of the copyrighted work, the material on the site that is infringing copyright, and a statement of good faith, signed under penalty of perjury, that the person submitting the notice owns the copyright and believes their rights have been infringed upon. GitHub’s form also requires the submitter to assert that they have taken fair use into consideration.
Once the notice is submitted and once the provider confirms that the above things are done, they must “expeditiously” remove public access to the content. The DMCA does not define “expeditiously”.
The owner of the accused content then has the opportunity to file a counter-notice. GitHub’s counter notice procedure is described in its guide to submitting a DMCA counter notice. This gives the person whose content has been removed recourse if they believe the original notice was invalid. A counter notice contains a statement by the user that the content was removed “as a result of mistake or misidentification”, signed under penalty of perjury. When a provider receives a counter notice, they must forward it to the original complainant, and restore the challenged material after 10-14 business days. At this point, if the complainant wishes to keep the material down, they must file a lawsuit against the alleged infringer.
By following these procedures as outlined by the DMCA, GitHub avoids liability to either party.
GitHub goes beyond what the DMCA law requires. All notices and counter notices that have ever been submitted to GitHub are published publicly in the github/dmca repository (with personal information redacted). GitHub has also historically sided with developers in high profile infringement cases, such as the youtube-dl incident in 2020 (this incident involved a takedown relating to copyright circumvention, which is a different part of the DMCA law that I haven’t discussed here because it isn’t relevant to the notice that was sent to SymPy).
It’s important to understand, however, that GitHub’s hands are tied in many ways here. If they do not follow the notice and counter notice procedures exactly as outlined in the DMCA law, they risk losing their safe harbor status with the US Copyright Office. Were this to happen, it would be completely disastrous to GitHub as a business. Without safe harbor protection, GitHub could be liable for any copyrighted content that someone uploaded to its platform.
Finally, a technical note: when someone submits a takedown notice on a single part of a GitHub repository, GitHub’s response is to take down the entire repository. This is because it is technically impossible to remove only part of a repository due to the way git works. Completely scrubbing data from git history is possible, but it’s a destructive process that’s very disruptive to developers. Instead, GitHub requires the repository owners to scrub the data themselves, if they choose to, within one business day. Otherwise, they will take the entire repository down.
What happened
Below I provide a quick timeline of what happened. Times are in Mountain Daylight Time, which is where I live.
First, for background, the SymPy documentation website https://docs.sympy.org/ is hosted on GitHub Pages and mirrors the repository https://github.com/sympy/sympy_doc. It is common for websites to be hosted on GitHub Pages because it’s completely free and easy to set up for someone who is already used to using git and GitHub. This blog itself is hosted on GitHub Pages. The sympy_doc repository only contains the pre-built HTML of SymPy’s documentation. The actual source code lives in the main SymPy repository. However, the DMCA claims were only ever made against the website, so the main SymPy source code repository was unaffected.
-
Friday, Apr 15, 4:26 PM: Admins of the sympy/sympy_doc GitHub repository (myself, Ondřej Čertík, and Oscar Benjamin) received an automated email from support@githubsupport.com informing us that a DMCA takedown notice was filed against the SymPy documentation. Specifically, the claim, which can now be found on the github/dmca repository, claimed that “Someone has illegally copied our client's technical exams questions and answers from their official website https://www.hackerrank.com/ and uploaded them on their platform without permission.” It specifically referenced the page https://docs.sympy.org/latest/modules/solvers/solvers.html. The notice was made on behalf of HackerRank by an individual from “WorthIT Solutions Pvt. Ltd.” The notice also stated “the infringing website is not willing to remove our client's work”, which came as a surprise to us since, at no point in time prior to receiving this notice had we received any communications from HackerRank or WorthIT Solutions.
The support email stated that if we do not remove the offending content within one business day, the repository, and consequently the docs website, will be taken down.
While it is not relevant to the legality of the situation, it is worth noting that this was on Good Friday and the upcoming Sunday, April 17, was Easter Sunday, which limited our ability to effectively respond over the weekend.
-
Friday, April 15 - Tuesday, April 19: At first, we were not sure if the support email was legitimate or if it was just convincing spam. One thing that confused us was that the support email came from a domain, githubsupport.com, which did not appear to exist. Another issue is that although GitHub’s policy was to post all DMCA claims to the github/dmca repository, this particular claim had not yet been uploaded there. We reached out to GitHub support to ask if the email was legitimate, and they responded Monday, April 18, 3:18 PM that the email was indeed a legitimate email from GitHub Trust & Safety.
I had already reached out on Friday to NumFOCUS for assistance. NumFOCUS is a 501(c)(3) non-profit organization that represents many open source scientific projects including SymPy. However, due to the limited time frame of the notice and the fact that it occurred over a holiday weekend, NumFOCUS was not able to connect us with legal counsel until Tuesday.
-
Tuesday, April 19, 12:21 PM: We received notice from GitHub that the sympy/sympy_doc repository has been taken offline. The sympy/sympy_doc repository was replaced with a notice that the documentation was taken down by the DMCA notice with a link to the notice, which had been posted to the public github/dmca repository, and the documentation website itself started to 404. Since this was now public knowledge, we publicly announced this on the SymPy mailing list and Twitter.
We also worked with NumFOCUS and the legal counsel to send a counter notice to GitHub, since we believed that the notice was mistaken and that the claimed infringement, the code and mathematical examples on the docs page, were likely not even copyrightable.
Not long after the takedown occurred, someone posted it to Hacker News. The Hacker News posting eventually made its way to the front page and by the end of the day it had made its way to rank 3 on the site, where it received hundreds of upvotes and comments. The story also generated a lot of buzz on Twitter at this time.
-
Tuesday, April 19, 6:40 PM: Vivek Ravisankar, the CEO of HackerRank, posted a public apology on the Hacker News thread:
Hello, I'm Vivek, founder/CEO of HackerRank. Our intention with this initiative is to takedown plagiarized code snippets or solutions to company assessments. This was definitely an unintended consequence. We are looking into it ASAP and also going to do an RCA to ensure this doesn't happen again. Sorry, everyone!
-
Tuesday, April 19, 8:26 PM: Vivek posted a followup comment on Hacker News:
Hello again, Vivek, founder/CEO here. In the interest of moving swiftly, here are the actions we are going to take: (1) We have withdrawn the DMCA notice for sympy; Sent a note to senior leadership in Github to act on this quickly. (2) We have stopped the whole DMCA process for now and working on internal guidelines of what constitutes a real violation so that these kind of incidents don't happen. We are going to do this in-house (3) We are going to donate $25k to the sympy project. As a company we take a lot of pride in helping developers and it sucks to see this. I'm extremely sorry for what happened here.
HackerRank did follow through with the $25k donation to SymPy.
-
Wednesday, April 20, 12:00 AM (approximately): The SymPy documentation website and the sympy/sympy_doc repository went back online.
-
Wednesday, April 20, 2022, 10:11 AM: We received an email notice from GitHub that the DMCA claim against our repository has been retracted. The retraction is made available online on the github/dmca repository.
My Views
Now that I’ve stated the facts, let me take a moment to give my own thoughts on this whole thing. My belief is that the DMCA claim that was made against the SymPy repository was completely baseless and without merit. Not only is no part of the SymPy documentation, to my knowledge, taken illegally from HackerRank or any other copyrighted source, but the claim itself was completely unspecific about which parts of the documentation were supposedly infringing. This made it impossible for us to comply even if the complaint was legitimate. In addition, it is questionable whether the supposed parts of the documentation that were taken from HackerRank, the mathematical and code examples, are even copyrightable. While this may not have been an actively targeted attack against SymPy, as a community run open source project, it served as one in practice.
We have never learned any more about which parts of the SymPy documentation were supposedly infringing on HackerRank’s copyright, and I expect we never will. Some people online have speculated that HackerRank actually took examples from the SymPy documentation, not the other way around, but I do not know if this is the case or not. It seems likely, given the large number of other takedown notices WorthIT has made to GitHub on HackerRank’s behalf, that they were using some sort of automated or semi-automated detection system, which somehow detected what it thought was infringing content on our website. I have personally never used the HackerRank platform, so I only know what sorts of things are on it from second-hand accounts.
Firstly let me say that if anyone or anything is to be blamed for this, my personal belief is that the blame should primarily lie on the DMCA law itself. While the “safe harbor” provisions of the law have been critical to the development of the modern internet, allowing social websites such as GitHub to exist without the burden of legal liability for user contributed content, other provisions are hostile to those same users, even those who are operating in a completely legal manner. The DMCA works in a “guilty until proven innocent” way, and the very operation of the takedown notice policy implicitly assumes that those making claims are not abusing the system as bad actors. It also incentivizes platforms such as GitHub to step aside and not take sides in claims, even claims such as this one, which refer to likely uncopyrightable material or are too unspecific to reasonably address even if they are legitimate.
The DMCA also has had further chilling effects due to its anti-circumvention provisions. The EFF has written extensively about these. While something as radical as abolishing copyright may be considered to be practically untenable, I believe that laws like the DMCA can be rewritten so that they still serve their intended function of protecting copyright owners and providing “safe harbor” to websites like GitHub, while also protecting the rights of those accused of infringing copyright.
With that being said, I think some blame does need to lie on HackerRank and WorthIT Solutions for abusing the DMCA system and filing claims like this. They also lied in their claim when they said “the infringing website is not willing to remove our client's work.” At no point were we contacted by HackerRank or WorthIT Solutions about any copyright infringement. If we were, we would have been happy to work with them to determine whether any content in SymPy is infringing on copyright and remove it if it was. Even if it were the case that SymPy’s documentation infringed on HackerRank’s copyright, the immediate escalation to a DMCA takedown notice, which legally forced GitHub to completely disable the entire SymPy documentation, was completely inappropriate. Even submitting a counter notice is risky, because by doing so it increases the likelihood that the complaining party will bring an infringement lawsuit. Many DMCA’d repositories do not submit a counter notice for this reason, and we were only able to do so comfortably because we were able to do so via NumFOCUS. Had HackerRank not retracted the notice, we would have had to wait for our counter notice to take effect, meaning our docs website would have remained unavailable for 10-14 days. And had they instead taken down the main SymPy repository instead of the sympy_doc repo where the docs are hosted, this would have been a major disruption for our entire development workflow.
I do want to thank Vivek for quickly retracting the notice once this came to his attention, and I hope that he will be rethinking their DMCA policies at HackerRank, and for their donation to support SymPy and NumFOCUS.
Finally, while many have blamed GitHub, I believe that for the most part, they have acted as they are effectively required to by law. It is important to understand that any similar website based in the US would have likely treated this claim in a similar way. For instance, here is GitLab’s DMCA policy, which you can see is very similar to GitHub’s. By all accounts, GitHub is an industry leader here, by posting all notices and counter notices online, which they are not required to do, and by being very open about their DMCA policies ([1], [2], [3]). With that being said, there are ways that GitHub’s processes could be improved here. The one business day that we were given to respond is actually standard in the industry, and ultimately comes from the “expeditiously” wording of the DMCA. But even so, if we were given more time than this, it would have made things much easier. It may have even been possible for us to reach out to HackerRank and resolve the situation before any actual takedown occurred. It also would be helpful if GitHub, while staying within the DMCA provisions, had a higher standard of legitimacy for takedown notices such as the one we received. As I have noted, the request we received was not nearly specific enough for us to effectively act on it, which should have been apparent to GitHub, since the only information WorthIT provided was a link to the HackerRank home page. Finally, I have two small technical requests from GitHub: first is to make githubsupport.com a real website so that support emails appear more legitimate, and second is to post DMCA notices to the DMCA repository right away rather than only after the repository is taken down, to make it clearer that the notices are in fact legitimate.
Open source communities such as SymPy are under-resourced and typically ill equipped to handle situations like these, which inherently puts them in an inferior position. This is despite the fact that open source software serves a global commons that benefits all. I have estimated that SymPy itself is used by hundreds of thousands of people, and the broader scientific Python ecosystem that it is part of is used by millions of people. Abuses of copyright law against these communities are harmful to society as a whole.
Lessons Learned
I’d like to end with a few lessons that I’ve learned from this whole process.
- Take DMCA takedown notices seriously. If you ever receive a DMCA takedown notice from GitHub or any other website, you need to take it seriously. The website is required by law to remove your content after they receive such a notice. You can submit a counter notice, but this increases the likelihood of being sued.
- NumFOCUS fiscal sponsorship is invaluable. Our NumFOCUS fiscal sponsorship was invaluable here. Thanks to NumFOCUS, we were able to immediately access high quality legal advice on how to proceed here. Had HackerRank not retracted their notice, or even, heaven forbid, decided to sue, we would have had significant assistance and support from NumFOCUS. If you are part of an open source project that doesn’t have fiscal sponsorship, I would recommend looking into NumFOCUS, or other similar organizations. And if you want to support NumFOCUS, I encourage you to do so.
- Make backups of your online data. While the source code of any GitHub repository is effectively backed up onto every computer that has a git clone. Other data such as issues and pull request comments are not. We were lucky that the DMCA notice was sent against our documentation repository instead of our main repository. If our main repository were taken down instead, we would have lost access to all GitHub issue and pull request history. I have been looking into effective ways to backup this data. If anyone has any suggestions here, please let me know in the comments.
I do feel that being on a site like GitHub is still preferable to something like self-hosting. If you self-host, you become responsible for a ton of things which GitHub does for you, like making sure everything stays online, handling servers, and managing spam. Also, self-hosting does not magically shield you from legal threats. If you self-host content that infringes on someone’s copyright, you are still legally liable for hosting that content.
Finally, I want to thank everyone who supported SymPy during this incident. Even if you just talked about this on social media or upvoted the Hacker News story, that helped us get this into the public eye, which led to a much faster resolution than we expected. I especially want to thank Leah Silen and Arliss Collins from NumFOCUS; Ondřej Čertík, Oscar Benjamin, and other SymPy community members; Pamela Chestek, the NumFOCUS legal counsel; and Travis Oliphant for the help they provided to us. Thank you to Thomas Dohmke, GitHub’s CEO, who we have reached out to privately, and who has promised to improve GitHub’s DMCA policies. I also want to thank Vivek Ravisankar from HackerRank for retracting the DMCA claim, and for the generous donation to SymPy.
Now that this incident is over, I’m hopeful we in the SymPy community can all go back to building software.
If you wish to donate to support SymPy, you may do so here.