Status Report: Mechanisms for Researcher Access to Online Platform Data Date Draft Released: April 5th, 2024 Academic and civil society research on prominent online platforms has become a crucial way to understand the information environment and its impact on our societies. Scholars across the globe have leveraged application programming interfaces (APIs) and web crawlers to collect public user-generated content and advertising content on online platforms to study societal issues ranging from technology-facilitated gender-based violence, to the impact of media on mental health for children and youth. Yet, a changing landscape of platforms’ data access mechanisms and policies has created uncertainty and difficulty for critical research projects. The United States and the European Union have a shared commitment to advance data access for researchers, in line with the high-level principles on access to data from online platforms for researchers announced at the EU-U.S. Trade and Technology Council (TTC) Ministerial Meeting in May 2023.1 Since the launch of the TTC, the EU Digital Services Act (DSA) has gone into effect, requiring providers of Very Large Online Platforms (VLOPs) and Very Large Online Search Engines (VLOSEs) to provide increased transparency into their services. The DSA includes provisions on transparency reports, terms and conditions, and explanations for content moderation decisions. Among those, two provisions provide important access to publicly available content on platforms: • DSA Article 40.12 requires providers of VLOPs/VLOSEs to provide academic and civil society researchers with data that is “publicly accessible in their online interface.” • DSA Article 39 requires providers of VLOPs/VLOSEs to maintain a public repository of advertisements. The announcements related to new researcher access mechanisms mark an important development and opportunity to better understand the information environment. This status report summarizes a subset of mechanisms that are available to European and/or United States researchers today, following, in part VLOPs and VLOSEs measures to comply with the DSA. The report aims at showcasing the existing access modalities and encouraging the use of these mechanisms to study the impact of online platform’s design and decisions on society. The list of mechanisms reviewed is included in the Appendix.This technical report is intended to facilitate further discussion on the topic during the technical workshop “Opening-up Platforms’ Black Boxes” and “From Data to Solutions against Technology-Facilitated Gender-Based Violence” held by the government of the United States and the European Commission at the TTC Ministerial Meeting in Leuven on April 4, 2024. The content of this report and its annexes is based on public information made available by service providers and builds on the work carried out by U.S. and EU researchers.2 The analysis presented in this document does not necessarily represent the official position of the European Commission or the United States Government. 1 Mechanisms for Providing Publicly Accessible Data • Platforms are taking different approaches to providing researchers access to public content, including application programming interfaces (APIs) and permission to scrape public content. • The level of detail regarding what data is available in each platform’s mechanisms vary. A few platforms include data dictionaries and documentation to accompany their public content access mechanism, others simply say their access mechanism makes “public data” available. • Many platforms require researchers to complete an application prior to receiving access to their public content access mechanism. The applications require researchers to share project details such as the research questions and timeline along with data protection plans. • Some platforms have explicitly stated they will grant access to researchers outside the European Union. • Platforms often require researchers to accept terms which include provisions related to required data refreshes, prepublication review, open access publishing, and data management. 1.1 Methods of Access Platforms are taking different approaches to providing researchers access to public content, referred to here as public content access mechanisms (these are individually described in the Appendix). Most platforms are requiring researchers to fill out an application prior to accessing these mechanisms. In some cases, the application grants the researcher free access to the platform’s already existing commercial API (examples: YouTube Researcher Program, X (formerly Twitter) API, and Reddit API), while other platforms have rolled-out new APIs specifically for researchers (examples: TikTok Research API, Meta Content Library and API).3 LinkedIn Researcher Access and Bing Qualified Researchers Program applications suggestthat once a researcher applies, they may receive data through an API “as applicable.”4 Additionally, some platforms are choosing to allow researchers to scrape or write code to crawl the web interface of an online platform and collate data. For example, Google Request Records allows researchers to apply for permission to partake in “limited scraping” of Google Shopping, Google Play, and YouTube content.5 Wikipedia and Bookings.com include language in their terms of service regarding when people can scrape their platforms for non-commercial purposes (no application required).6 1.2 Data Availability The data available in each public content mechanism varies based on domain and types of user generated content. The public data available across platforms and search engines varies both in content, volume of personal data and structure, and also depends on the specificities of each service. For mechanisms that are provisioning access via unlimited scraping (e.g., Booking.com, Wikipedia) whatever information is on the surface of the webpage and can be retrieved by a web crawler is available. Data provided through limited scraping or an API is bound by what the platform makes available. Some platforms have included public documentation to accompany their mechanism that describes what is available (see Appendix). Thus far, platforms are making different choices about what to include. For example, some platforms may include data regarding a page or channel name change, or the subject category associated with a post, while others may not. Regarding the geographical location or citizenship of the data subjects included in the public content mechanisms, a few mechanisms state that the data will be from platform users around the world. For example, YouTube Research Program states that researchers have access to the “entire public YouTube corpus.”7 LinkedIn Researcher Access and Bing Qualified Researcher Program are not explicit in their terms but include questions on their researcher access applications such as “In which country do you expect to store the requested public data?” accompanied by a picklist including every country as an option, suggesting researchers can apply for public content sourced globally.8 Similarly, Meta Content Library and API’s application includes a drop-down menu for “research region of interest” and “primary research country of interest.”91.3 Rate Limits/Quotas Some platforms dictate the volume of public data researchers can access through rate limits and quotas. For example, TikTok Research API has a daily limit of “1,000 requests per day” across their APIs allowing up to 100,000 records per day. 10 YouTube Researcher API also sets a quota of 10,000 units per day (units vary based on type of data), but researchers can apply for a quota extension.11 A few platforms such as Reddit and Google Request Records note there may be limits at the platform’s “sole discretion.”12 1.4 Conditions for Access As mentioned above, many platforms are conditioning access based on an application process. While each application has a unique set of questions, they typically ask about the researcher(s) and research institution(s), the research proposal and data needs, description of research funding (proof of independence), description of data protection measures, and an assertion that the researcher agrees to terms of use and privacy policies. With the exception of the Meta Content Library and API, decisions about which researchers get data access fall to the platform. Meta has chosen to partner with the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan to vet applications and assist in onboarding to Meta Content Library and API.13 In this scenario, ICPSR will notify Meta and the applicant if the application is approved. However, access to Meta’s Content Library and API is still contingent upon the fulfilment of contractual obligations that apply both to the individual researcher and to the researcher’s affiliated institution.14 A deeper look at the application questions and stated criteria provide insights into what research may come from these mechanisms. 1.4.1 Geographic Location of Researchers Platforms are determining the eligibility of the researchers/research organizations based on geographical location. For example, Meta’s Content Library and API is available to researchers globally,15 whereas TikTok’s Research API is available to researchers in the EU and U.S.16 Google Request Records description states “This program is currently only available for researchers based in the European Union (EU) and may be expanded upon in the future.”17 Many of the programs listed in the Appendix do not explicitly state where researchers/research institutions must be located and instead say that their applications will be considered “in accordance with the Digital Service Act (DSA).”18 AliExpress Open Research & Transparency, LinkedIn Researcher Access, YouTube Researcher Program, Pinterest Researchers Intake, and Bing Qualified Research Program include a question ontheir application asking for the researcher’s country, suggesting they may accept applications from around the world.19 1.4.2 Affiliation and Qualifications Most mechanisms are allowing researchers from a range of non-commercial institutions to apply for access. Several applications only ask for the name of an applicant’s organisation (example: LinkedIn Researcher Access, AliExpress Open Research & Transparency, Bing Qualified Research Program, Snap Researcher Data Access).20 Google Request Records accepts researchers that are affiliated with “not-for-profit bodies, organisations and associations,” and its application form makes a distinction between a) academic institutions, b) non-profit organization/charity/NGO, c) government affiliated research organization, d) independent research organization, and e) “other”.21 X (formerly Twitter) API application similarly asks researchers to “describe your organization’s affiliation,” explicitly mentioning NGOs as an example group, and further requests information about an organization’s “not-for- profit status.”22 Meta’s Content Library and API accepts applications from researchers “affiliated with an academic institution or other non-university organization, institute, or society that operates as a not-for-profit entity and holds scientific or public interest research as a primary purpose or core activity."23 TikTok Research API and YouTube Researcher Program focus on applications from academic institutions, but YouTube Researcher Program adds that “qualified institutions” can also include “any government or other institution required by law or regulation to have access to program data.”24 The applications also vary in how much information they require regarding the researcher’s qualifications and experience. Some applications don’t include any questions about the researcher’s professional background (examples: Snap Researcher Data Access, Reddit Researcher Access Request, Pinterest Researchers Intake), while Bing Qualified Research Program and LinkedIn Researcher Access request a link to the researcher’s LinkedIn profile.25 AliExpress Open Research & Transparency requests a CV and list of past publications, TikTok Research API asks applicants to “provide links to your past publications or attended conferences."26 Similarly, Meta’s Content Library and API application requires the lead researcher to provide evidence of skills in coding or querying languages suggesting a GitHub repository and “up to 3 citations or examples of your research that demonstrate your experience using sensitive data.”27 1.4.3 Funding/Independence To ensure projects are undertaken for research and not commercial purposes, applications include questions about funding and conflicts of interest. Several applications explicitly ask who is funding the research (examples: AliExpress Open Research & Transparency, Google Request Records, Meta Content Library, Snap Researcher Data Access, LinkedIn ResearcherAccess, Pinterest Researchers Intake).28 Additionally, many include questions along the lines of “Are you and your organisation independent from commercial interests? Otherwise, please specify any such interest.”29 Google Request Records asks the researcher to “provide evidence that you/your organization is independent from commercial interests,” and offers a list of evidence such as a copy of organisation bylaws or tax-exemption certification.30 X (formerly Twitter) API application explicitly asks for any relevant information related to an organisation’s board members, and the organisation’s shareholders, or grant recipients. 31 The Product Terms for Meta Research Tools asks that researchers “promptly disclose to Meta any conflicts of interest that currently exist or may arise with respect to the research performed in connection with the research tools or product terms.”32 TikTok Research API terms explicitly request that a researcher is free from any affiliation with the platform, including through its parent company or subsidiary, for instance as an intern, freelancer, vendor, or consultant.33 1.4.4 Research Proposal All the applications ask researchers to describe their research purpose, but the level of detail varies. The applications ask researchers about the topic areas they are covering. Some platforms include a yes/no question along the lines of “Your planned research activities will be carried out for the purposes laid down in article 40(4) DSA, and the expected results of the research will also contribute to such purposes.”34 Others ask an open-ended question regarding how the researcher’s topics relate to systemic risks as defined by the DSA (examples: Bing Qualified Researcher Program, LinkedIn Researcher Access, X (formerly Twitter) API).35 TikTok Research API and Google Request Records ask applicants to select which “research category” their proposal falls under and the categories elude to systemic risks.36 Snap Researcher Data Access and Reddit Researcher Access Request applications have an open-ended question for the research topic description and do not include a request to tie the project to systemic risks.37 Nearly every application asks the researcher to describe their research design and methodology. The level of detail ranges from the LinkedIn Researcher Access application’s 250 character text box for a description of a project proposal that includes “a research problem to be solved, data to be used, project timeline, key milestones, and outcomes,”38 to Google Request Records application’s 5,000 character text box to describe similar information.39 TikTok Research API goes beyond asking for research questions and asks for an “explanation of your hypotheses and context from the current literature,” including a summary of the literature review.40 Within the research design questions, researchers are asked to describe the data they wish to access and why, often in an open textbox. Reddit Researcher Access Request asks for the specific subreddits under consideration,41 while Meta Content Library and API requires a “list of keywords associated with the research project.”42 X (formerly Twitter) API also asks applicants to describe why the data is needed and “not otherwise currently accessible to you by other sufficient means.”43 Many platforms ask for the length of time data is needed(example: AliExpress Open Research & Transparency, LinkedIn Researcher Access, Bing Qualified Researcher Program, Pinterest)44 and X (formerly Twitter) API asks for “a detailed justification for this timeframe.”45 Two providers include time limits to access the data: TikTok Research API application highlights that the longest time one can apply for access is two years,46 whereas Google Request Records applicationstates that “data access and related permissions will be granted until the research end date or a maximum of 9 months at a time – and can be renewed if required.”47 Snap Researcher Data Access simply asks for “details on the timeframe of the requested data.”48 1.4.5 Data Protection Many applications inquire about the researcher’s plans to protect data. Applications include questions such as “Describe the technical and organizational measures you have put in place to protect personal data and ensure data security and confidentiality.”49 Google Request Records application gives the researcher’s suggestions such as a “strong password policy” and “access logging.”50 LinkedIn Researcher Access and Bing Qualified Researcher Program applications include a question inquiring who will be responsible for GDPR compliance.51 The Meta Content Library and API application asks researchers to describe the ethical guidelines they plan to consider and the example ethical guidance includes respecting data subject privacy.52 Some access mechanisms require the researcher to submit certification of data protection policies. Google Request Records application asks researchers to submit evidence such as: “GDPR certification, an evaluation report from an independent assessor, or any other appropriate evidence” that demonstrates that the researcher’s host organisation is capable of fulfilling applicable data security and confidentiality requirements.53 X (formerly Twitter) API encourages applicants to describe any “compliance certifications” that demonstrate compliance with data security practices. 54 TikTok Research API application presents a “compliance certificate” that an applicant needs to agree to before data can be shared.55 1.5 Agreement to Terms At the end of most applications there are requirements to agree to the platform’s terms ranging from developer policies, to data protection requirements, to general terms of use. Bing Qualified Researcher Program and Pinterest Researchers Intake state that they provide terms after a researcher completes the application56 and Snap Researcher Data Access allows researchers to submit an application without agreeing to terms.57 These terms can impact research.1.5.1 Data Refresh Platforms are taking different approaches to handling data that users have posted publicly then later modified or deleted. A few platforms require deleted data to be removed from the researcher’s dataset within a short time period through data refresh. For example, TikTok’s terms state “you agree to regularly refresh TikTok Research API Data at least every fifteen (15) days, and delete data that is not available from the TikTok Research API at the time of each refresh.”58 X (formerly Twitter) API directs researchers to make “reasonable efforts” to delete content that has been deleted by the user within 24 hours.59 YouTube Researcher Program API similarly directs the researcher to “regularly refresh program data…until such time as you need the program data to be fixed as to a point in time for the purposes of finalizing your analysis.”60 Meta’s Content Library and API is unique in that the responsibility to remove deleted content from the dataset does not fall to the researcher. Once a researcher completes an API pull, the data remains in a virtual data enclave, the secure environment is then refreshed every 30 days.61 1.5.2 Prepublication Review/Requirements A handful of access mechanisms require researchers to provide platforms with a copy of research before it is public. YouTube Researcher Program requests that the researcher “use reasonable efforts” to provide YouTube with a copy of each publication at least seven (7) days before its publication. This is meant solely as a “courtesy notice to YouTube. YouTube will not have editorial discretion or input in any researcher publication.”62 TikTok Researcher API requires the researchers to provide TikTok with a copy of any publications pertaining to or containing the results and findings of the research outputs, and any supporting information, at least seven (7) days before publication primarily to identify any user private personal data that may need to be removed prior to publication or disclosure.63 Meta Content Library and API terms request that researchers submit “sufficiently ahead of the planned publication or disclosure date (and in any event, at least thirty (30) days ahead of such date)” in order to be able to have the opportunity to “review drafts of any publications pertaining to or containing the results and findings of the Research Projects” for the purpose of identifying “any confidential information or any personal data included or revealed in those materials.”64 LinkedIn’s terms asks researchers to “agree to use reasonable efforts to provide LinkedIn with access to a courtesy copy of any research work product at least five (5) business days before access is granted to any third-party research end user.”65 1.5.3 Open Access Some platforms encourage researchers to publish their research in open access journals, which are free to the public. YouTube Researcher Program, Meta Content Library API, andTikTok Research API all include statements in their terms directing researchers to “use reasonable efforts” to publish all researcher publications in open access journals or publications and/or as open access resources on other websites.66 1.5.4 Additional Data Management Provisions Beyond asking researchers questions about their data protection plans, the data access mechanisms’ terms of use often include additional stipulations regarding data treatment. TikTok’s terms on global data sharing state that a researcher needs to “implement technical and organisational security measures to protect the personal data it receives” and those measures shall include 43 “minimum security measures.” These measures are organised in the following categories: (i) system entry control, (ii) physical access controls, (iii) data access control, (iv) data transfer control, (v) input control, (vi) availability control.67 Meta has a data protection chapter in its Product Terms for Meta Research Tools, which requires researcher to have “an appropriate level of security for the Meta data and to protect the Meta data against accidental or unlawful destruction, loss, alteration, unauthorized disclosure of, or access to the Meta data” listing a set of requirements.68 X (formerly Twitter) sets out that API users shall not “use, or knowingly display…information derived from X content” to segment based on (..) political affiliation or beliefs, (..) religious or philosophical affiliation or beliefs, (..) sex life or sexual orientation.”69 Google Request Records further specifies in its Acceptable Use Policy that the researcher and his organisation (i) “have and comply with a written privacy policy that clearly and accurately describes (a) the program data that you will access, collect, and store, and (b) how and why you will use and process that program data, (ii) have and comply with reasonable and appropriate administrative, organizational, technical, and physical controls designed to protect program data against accidental, unauthorized, or unlawful destruction or accidental loss, alteration or destruction, and to protect against any unauthorized disclosure or access. These controls must be documented and must provide a level of security appropriate to the risk represented by the process and nature of the data to be protected. ”702 Advertisement Repositories • Many platforms providing insights into advertising content are providing both a searchable web interface and directions for accessing advertisements via an API. • The ad repositories vary in data availability but largely include an advertiser ID, ad content, delivery dates and information describing how the ad is targeted. • The ad repositories listed in the Appendix are available to anyone in the world with access to the online platform’s website, but in most cases, detailed impression data is only available for ads served in Europe. 2.1 Methods of Access Many platforms providing insights into commercial content are providing both a searchable web interface and directions for accessing advertisements via an API. Some of the APIs require the user (researcher) to set up a developer account and sign developer terms (examples: Google Ads Transparency Center, LinkedIn Ad Library, Meta Ad Library, TikTok Commercial Content API, X (formerly Twitter) Ads Repository).71 As described above, these terms can include rate limits and other obligations for their users. 2.2 Conditions for Access The platforms providing ad repositories have made the data available to the public with no requirement to be a researcher. Only TikTok and LinkedIn require users of their ad repository to apply for API access. LinkedIn’s application requires basic information about the user’s application.72 TikTok’s Commercial Content Library application requires a description of the researcher’s area of expertise and a detailed description of the project, similar to TikTok’s Researcher API.73 2.3 Data Availability The ad repositories vary in data availability, but largely include an advertiser ID, ad content, delivery dates, and information describing how the ad is targeted. The context of the platform can lead to different additional fields. For example, Apple Ad Repository API has a parameter called “placement” which describes where the ad was placed in the app store.74 At the same time, some ad repositories offer details on the number of impressions of a given ad, whereas others do not (e.g. Apple Ad Repository does not offer the number of impressions of a givenadvertisement), 75 and some group the advertisements served on different services in a single repository (this applies, for example, to Google Search and YouTube advertisements).76 The ad repositories listed in the Appendix are available to anyone in the world with access to the online platform’s website. The contents of the repositories often only include ads that were served in the EU/EEA (examples: Apple Ad Repository, X Ad Repository, Microsoft Ad Library, Pinterest Ad Repository, Booking.com, TikTok Ad Library).77 Google Ads Transparency Center include ads from across the globe, but ads served in Europe are accompanied by additional data such as targeting information and total number of recipients within each country.78 Meta Ad Library similarly includes ads from across the globe, but the information included varies based on where the ad was served. For example, in the US, “issues, election, or politics” ads have additional data such as total amount, total impressions received, and demographic information on ad reach (age, gender, location). 79Appendix The information presented in these tables is based on public information made available by service providers and builds on the work carried out by U.S. and EU researchers. The analysis presented in this document does not necessarily represent the official position of the European Commission or the United States Government. Mechanism Description Interface Who has Access? Application Required Data Dictionary /Documentation Available AliExpress Open Research & Transparency The mechanism description URL suggests an API Academic and civil society researchers as described in DSA Article 40(12). Application includes “Country where you are established/affiliated” suggesting researchers from other countries can apply. Yes No Booking.com Scraping Provision1 Scraping Anyone scraping for non-commercial purposes No No Bing Qualified Researcher Program API "as applicable" (suggesting other methods may be available) Academic and civil society researchers as described in DSA Article 40(12). Application includes a text box “Organization Country” suggesting researchers from other countries can apply. Yes No 1 Booking.com, Customer terms of service, https://www.booking.com/content/terms.html [accessed March 26, 2024] Scraping is only banned for commercial purposes: “You’re not allowed to monitor, copy, scrape/crawl, download, reproduce, or otherwise use anything on our Platform for any commercial purpose without written permission of Booking.com or its licensors.”Mechanism Description Interface Who has Access? Application Required Data Dictionary /Documentation Available Bing Qualified Researcher Program API "as applicable" (suggesting other methods may be available) Academic and civil society researchers as described in DSA Article 40(12). Application includes a text box “Organization Country” suggesting researchers from other countries can apply. Yes No Google Request Records Varies Maps: Access to public data through a cloud- based solution, Play: Permission for limited scraping, Search: Access to an API for limited scraping with a budget for quota, Shopping: Permission for limited scraping, YouTube: Permission for limited scraping Researchers affiliated with EU-based organizations Yes NoMechanism Description Interface Who has Access? Application Required Data Dictionary /Documentation Available LinkedIn Researcher Access API "as applicable" (suggesting other methods may be available) Academic and civil society researchers as described in DSA Article 40(12). Application also includes a question ("organization country") which lists countries around the world suggesting researchers from other countries can apply. Yes No Meta Content Library and API Searchable user interface and API provided through a virtual data enclave. "To be eligible for product access, researchers must either be affiliated with an academic institution or other non-university organization, institute, or society which operates as a not-for- profit entity and holds scientific or public interest research as a primary purpose or core activity." Yes Yes Pinterest Researchers Intake “If you’re a researcher” Yes NoMechanism Description Interface Who has Access? Application Required Data Dictionary /Documentation Available Reddit Researcher Access Request2 Commercial API Researchers accessing data for non- commercial purposes can apply for free access to the commercial API, application includes: "From what country or countries will you access the Reddit API?" Yes Yes Snap Researcher Data Access Requests are "in accordance with the Digital Services Act (DSA)" Yes No TikTok Research API API Researchers from US and Europe Yes Yes Wikipedia Tools3 Scraping and a set of APIs Public No Yes X (formerly Twitter) API Commercial API Different levels of access on the basis of fees and free access for researchers studying a “narrow subset of EU research related to the DSA” Yes Yes 2 Researchers apply by selecting “API support and inquiries” under “what do you need assistance with? Then under “from what position are you reaching out for support” select “I’m a researcher” then under “what is your inquiry?” select “I’m a researcher and want to sign up to use the Reddit and/or Reddit Data” at Reddit, Submit a request, https://support.reddithelp.com/hc/en-us/requests/new?ticket_form_id=14868593862164 [accessed March 25, 2024] 3 Wikimedia Foundation, Wikimedia Foundation Terms of Use, https://foundation.wikimedia.org/wiki/Policy:Terms_of_Use/en [accessed March 25, 2024] “Re-use: Reuse of content that we host is welcome, though exceptions exist for content contributed under "fair use" or similar exemptions under applicable copyright law. Any reuse must comply with the underlying license(s).”Mechanism Description Interface Who has Access? Application Required Data Dictionary /Documentation Available YouTube Researcher Program Commercial API Must be affiliated with an “eligible academic institution” from a list of countries that extends beyond the US and Europe Yes Yes Ad Repositories Apple Ad Repository and API Searchable web interface and API Public No Yes Booking Ads Repository and API Searchable web interface and API Public No Yes Google Ads Transparency Center4 Searchable web interface and API Public No Yes LinkedIn Ad Library Searchable web interface and API Public Yes Yes Meta Ad Library Searchable web interface and API Public No Yes 4 API available through BigQuery Public Data (marketplace/details/bigquery-public-data/google-ads-transparency-center)Mechanism Description Interface Who has Access? Application Required Data Dictionary /Documentation Available Microsoft Ad Library Searchable web interface and API Public No Yes Pinterest Ads Repository Searchable web interface Public No No Snap Ads Gallery Searchable web interface Public No No Snap Political Ads Library CSV files Public No Yes TikTok Ad Library5 Searchable web interface and API Public Yes Yes X (formerly Twitter) Ads Repository Searchable web interface Public No Yes 5 The API with similar information to the ad library searchable interface is accessed via the Commercial Content API (https://developers.tiktok.com/doc/commercial-content-api-get-ad-details/)Endnotes 1 European Commission, Transparent and Accountable Online Platforms, May 26, 2023, https://digital- strategy.ec.europa.eu/en/library/transparent-and-accountable-online-platforms 2 In particular by The George Washington University Institute for Data Democracy & Politics (Platform Transparency Tools Tracker), and Dr. Mathias Vermeulen (AWO) in his capacity of independent expert supporting the European Commission’s work on data access for researchers. 3 See Appendix. 4 See LinkedIn, Researcher Access, LinkedIn Help, https://www.linkedin.com/help/linkedin/answer/a1645616? [accessed March 25, 2024]. See also Bing Qualified Researcher Program (Microsoft Office Form) [accessed March 25, 2024]. 5 Google, Google Research Program, Google Request Records, https://requestrecords.google.com/researcher [accessed March 25, 2024]. 6 See specifics clauses in Appendix footnotes. 7 YouTube, How it Works, https://research.youtube.com/how-it-works/ [accessed March 25, 2024]. 8 See LinkedIn, Research Access, LinkedIn Help, https://www.linkedin.com/help/linkedin/answer/a1645616 [accessed March 25, 2024]. See also Bing Qualified Researcher Program (Microsoft Office Form) [accessed March 25, 2024] which does not include a picklist but an open text box for country. 9 See under "Research Project Information" on SOMAR InfoReady Application Guide for Meta Content Library and Content Library API (Google Doc) [accessed March 25, 2024]. 10 TikTok for developers, Frequently Asked Questions, https://developers.tiktok.com/doc/research-api-faq/ [accessed March 26, 2024]. 11Example from YouTube Data API, YouTube Data API-Quota and Compliance Audits https://developers.google.com/youtube/v3/guides/quota_and_compliance_audits [accessed March 26, 2024]. 12 See Reddit, Data API Terms, last revised April 18, 2023, https://www.redditinc.com/policies/data-api-terms [accessed March 25, 2024] “Reddit may set and enforce limits on your use of the Data APIs (e.g., limiting the number of API requests that you may make or the number of App Users you may serve), in our sole discretion.” See also similar language in Google, Google Researcher Program Acceptable Use Policy, Google Request Records, https://requestrecords.google.com/researcher/policy [accessed March 25, 2024] “Your use of Program Access and Program Data is subject to your agreement to abide by the technical limitations (e.g., rate limits) and security requirements for each Google service you access. These technical and security limitations may evolve over the course of the program, and will be provided to you upon successfuladmittance into the program, where these requirements are applicable. They will at no time prohibit the publication of findings and methodology from research undertaken in the context of this program.” 13 Meta, Meta Content Library and API, Transparency Center https://transparency.fb.com/researchtools/meta- content-library [accessed March 25, 2024]. 14 See mention of data agreements: SOMAR, Meta Content Library and Content Library API https://somar.infoready4.com/#freeformCompetitionDetail/1910793 [accessed March 25, 2024]. 15 Meta clarifies that the researcher applicant, and any academic university or institution with which the applicant is affiliated, must not be in a jurisdiction that is the target of sanctions imposed by the United States, United Kingdom, European Union or United Nations.” See Meta for Developers, Get access, https://developers.facebook.com/docs/content-library-api/get-access [accessed March 25, 2024]. 16 TikTok for developers, Research API, https://developers.tiktok.com/products/research-api/ [accessed March 25, 2024]. 17 Google, Researcher Engagement, Transparency Center, https://transparency.google/intl/en_us/researcher- engagement/ [accessed March 25, 2024]. 18 Example from Snapchat, Research Data Access Instructions, Privacy and Safety Hub, (March 25, 2024 (https://values.snap.com/privacy/transparency/researcher-access). Other applications have similar language: LinkedIn, Contact LinkedIn support, in Help https://www.linkedin.com/help/linkedin/ask/DSA [accessed March 25, 2024]. See also X DSA Researcher Application (Google Form) [accessed March 25, 2024]. See also AliExpress, AliExpress Open Research & Transparency, Open Research and Transparency https://yida.alibaba- inc.com/o/research/api#/ [accessed March 25, 2024]. See also Bing Qualified Researcher Program (Microsoft Office Form) [accessed March 25, 2024]. 19 See YouTube, YouTube Researcher Program Application, YouTube Help, https://support.google.com/youtube/contact/yt_researcher_certification [accessed March 25, 2024]. See also AliExpress, AliExpress Open Research & Transparency, Open Research and Transparency, https://yida.alibaba- inc.com/o/research/api [accessed March 25, 2024]. See also Pinterest, Researchers intake form, Help Center, https://help.pinterest.com/en/landing/researchers-intake-form [accessed March 25, 2024]. See also Bing Qualified Researcher Program (Microsoft Office Form) [accessed March 25, 2024]. See also LinkedIn, Contact LinkedIn support, in Help https://www.linkedin.com/help/linkedin/ask/DSA [accessed March 25, 2024]. 20 See LinkedIn, Contact LinkedIn support, in Help https://www.linkedin.com/help/linkedin/ask/DSA [accessed March 25, 2024]. See also AliExpress, AliExpress Open Research & Transparency, Open Research and Transparency https://yida.alibaba-inc.com/o/research/api#/ [accessed March 25, 2024]. See also Bing Qualified Researcher Program (Microsoft Office Form) [accessed March 25, 2024]. See also Snapchat, Research Data Access Instructions, Privacy and Safety Hub, https://values.snap.com/privacy/transparency/researcher-access [accessed March 25, 2024]. 21 Google Request Records, Google Researcher Program Application, https://requestrecords.google.com/researcher/form [accessed March 25, 2024].22 X DSA Researcher Application (Google Form) [accessed March 25, 2024]. 23 Meta, Meta Content Library and API, Transparency Center https://transparency.fb.com/researchtools/meta- content-library [accessed March 25, 2024]. 24 See TikTok for developers, Research API, https://developers.tiktok.com/products/research-api/ [accessed March 25, 2024] "non-profit academic institution" under "who can apply?" See also YouTube, Program Terms & Conditions, Definitions, https://research.youtube/policies/terms/ [accessed March 25, 2024]. 25 See Snapchat, Research Data Access Instructions, Privacy and Safety Hub, https://values.snap.com/privacy/transparency/researcher-access [accessed March 25, 2024]. See also Reddit Researcher Access Request instructions in Appendix. See also Pinterest, Researchers intake form, Help Center, https://help.pinterest.com/en/landing/researchers-intake-form [accessed March 25, 2024]. See also LinkedIn, Contact LinkedIn support, in Help https://www.linkedin.com/help/linkedin/ask/DSA [accessed March 25, 2024]. See also Bing Qualified Researcher Program (Microsoft Office Form) [accessed March 25, 2024]. 26 TikTok for developers, Research API, https://developers.tiktok.com/products/research-api/ [accessed March 25, 2024]. See also AliExpress, AliExpress Open Research & Transparency, Open Research and Transparency, https://yida.alibaba-inc.com/o/research/api [accessed March 25, 2024]. 27 SOMAR InfoReady Application Guide for Meta Content Library and Content Library API (Google Doc) [accessed March 25, 2024]. 28 AliExpress, AliExpress Open Research & Transparency, Open Research and Transparency, https://yida.alibaba-inc.com/o/research/api [accessed March 25, 2024]. See also Pinterest, Researchers intake form, Help Center, https://help.pinterest.com/en/landing/researchers-intake-form [accessed March 25, 2024]. See also Snapchat, Research Data Access Instructions, Privacy and Safety Hub, https://values.snap.com/privacy/transparency/researcher-access [accessed March 25, 2024]. See also "Research Project Information" SOMAR InfoReady Application Guide for Meta Content Library and Content Library API (Google Doc) [accessed March 25, 2024]. See also LinkedIn, Contact LinkedIn support, in Help https://www.linkedin.com/help/linkedin/ask/DSA [accessed March 25, 2024]. See also Google Request Records, Google Researcher Program Application, https://requestrecords.google.com/researcher/form [accessed March 25, 2024]. 29 Example from AliExpress, AliExpress Open Research & Transparency, Open Research and Transparency https://yida.alibaba-inc.com/o/research/api#/ [accessed March 25, 2024]. See also Snap, Researcher Data Access Instructions, Privacy and Safety Hub https://values.snap.com/privacy/transparency/researcher-access [accessed March 25, 2024] has a similar question: "Confirmation that your research is for non-commercial purposes." 30 Google Request Records, Google Researcher Program Application, https://requestrecords.google.com/researcher/form [accessed March 25, 2024]. 31 X DSA Researcher Application (Google Form) [accessed March 25, 2024].32 Meta, Product Terms for Meta Research Tools, updated Feb 2, 2024, https:/transparency.fb.com/researchtools/product-terms-meta-research [accessed March 25, 2024]. 33 Qualified Research Partners "are free from any affiliation with TikTok and its affiliates (e.g., not working for TikTok or its affiliates as an intern, freelancer, vendor, or consultant)" TikTok, TikTok Research API Terms of Service, Aug 10,2023 https://www.tiktok.com/legal/page/global/terms-of-service-research-api/en [accessed March 25, 2024]. 34 Exampled from AliExpress, AliExpress Open Research & Transparency, Open Research and Transparency https://yida.alibaba-inc.com/o/research/api#/ [accessed March 25, 2024]. See also Bing Qualified Researcher Program (Microsoft Office Form) [accessed March 25, 2024]. See also SOMAR InfoReady Application Guide for Meta Content Library and Content Library API (Google Doc) [accessed March 25, 2024]. See also LinkedIn, Contact LinkedIn support, in Help https://www.linkedin.com/help/linkedin/ask/DSA [accessed March 25, 2024]. See also Pinterest, Researchers intake form, Help Center, https://help.pinterest.com/en/landing/researchers- intake-form [accessed March 25, 2024]. 35 Bing Qualified Researcher Program (Microsoft Office Form) [accessed March 25, 2024]. See also LinkedIn, Contact LinkedIn support, in Help https://www.linkedin.com/help/linkedin/ask/DSA [accessed March 25, 2024]. See also X DSA Researcher Application (Google Form) [accessed March 25, 2024]. 36 TikTok for developers, Research API Application https://developers.tiktok.com/application/research-api [accessed March 25, 2024]. See also Google Request Records, Google Researcher Program Application, https://requestrecords.google.com/researcher/form [accessed March 25, 2024]. 37 See Snapchat, Research Data Access Instructions, Privacy and Safety Hub, https://values.snap.com/privacy/transparency/researcher-access [accessed March 25, 2024]. See also Instructions for Reddit Researcher Access Request in the Appendix. 38 LinkedIn, Contact LinkedIn support, in Help https://www.linkedin.com/help/linkedin/ask/DSA [accessed March 25, 2024]. 39 Google Request Records, Google Researcher Program Application, https://requestrecords.google.com/researcher/form [accessed March 25, 2024]. 40 TikTok for developers, Research API Application https://developers.tiktok.com/application/research-api [accessed March 25, 2024]. 41 See instructions in Appendix for Reddit Researcher Access Request. 42 SOMAR InfoReady Application Guide for Meta Content Library and Content Library API (Google Doc) [accessed March 25, 2024]. 43 X DSA Researcher Application (Google Form) [accessed March 25, 2024]. 44 AliExpress, AliExpress Open Research & Transparency, Open Research and Transparency https://yida.alibaba- inc.com/o/research/api#/ [accessed March 25, 2024]. See also Bing Qualified Researcher Program (MicrosoftOffice Form) [accessed March 25, 2024]. See also LinkedIn, Contact LinkedIn support, in Help https://www.linkedin.com/help/linkedin/ask/DSA [accessed March 25, 2024]. See also Pinterest, Researchers intake form, Help Center, https://help.pinterest.com/en/landing/researchers-intake-form [accessed March 25, 2024]. 45 X DSA Researcher Application (Google Form) [accessed March 25, 2024]. 46 TikTok for developers, Research API Application https://developers.tiktok.com/application/research-api [accessed March 25, 2024]. 47 See Google Request Records, Google Researcher Program Application, https://requestrecords.google.com/researcher/form [accessed March 25, 2024] "Note: Data access and related permissions will be granted until the research end date or a maximum of 9 months at a time – and can be renewed if required." 48 Snap, Researcher Data Access Instructions, Privacy and Safety Hub https://values.snap.com/privacy/transparency/researcher-access [accessed March 25, 2024]. 49 Example from Pinterest, Researchers intake form, Help Center https://help.pinterest.com/en/landing/researchers-intake-form [accessed March 25, 2024]. See also similar language in AliExpress, AliExpress Open Research & Transparency, Open Research and Transparency https://yida.alibaba-inc.com/o/research/api#/ [accessed March 25, 2024]. See also Bing Qualified Researcher Program (Microsoft Office Form) [accessed March 25, 2024]. See also LinkedIn, Contact LinkedIn support, in Help https://www.linkedin.com/help/linkedin/ask/DSA [accessed March 25, 2024]. See also X DSA Researcher Application (Google Form) [accessed March 25, 2024]. See also SOMAR InfoReady Application Guide for Meta Content Library and Content Library API (Google Doc) [accessed March 25, 2024]. 50 Google Request Records, Google Researcher Program Application, https://requestrecords.google.com/researcher/form [accessed March 25, 2024]. 51 LinkedIn, Contact LinkedIn support, in Help https://www.linkedin.com/help/linkedin/ask/DSA [accessed March 25, 2024]. See also Bing Qualified Researcher Program (Microsoft Office Form) [accessed March 25, 2024]. 52 See reference to Association of Internet Researchers in SOMAR InfoReady Application Guide for Meta Content Library and Content Library API (Google Doc) [accessed March 25, 2024]. See also franzke, aline shakti, Bechmann, Anja, Zimmer, Michael, Ess, Charles and the Association of Internet Researchers (2020). Internet Research: Ethical Guidelines 3.0. https://aoir.org/reports/ethics3.pdf 53 See Google Request Records, Google Researcher Program Application, https://requestrecords.google.com/researcher/form [accessed March 25, 2024] when applicant selects "yes" to GDPR obligations the following appears "Please provide evidence for this such as a GDPR certification, an evaluation report from an independent assessor, or any other appropriate evidence."54 X DSA Researcher Application (Google Form) [accessed March 25, 2024]. 55 TikTok, Compliance Certification, https://www.tiktok.com/legal/page/global/compliance-certification/en [accessed March 25, 2024]. 56 Microsoft, Bing Qualified Researcher Program, Support, https://support.microsoft.com/en-us/topic/bing- qualified-researcher-program-b1c5a4b6-3ad6-4c8c-963d-a395c7766c85 [accessed March 26, 2024] “Selected researchers may be asked to enter into terms governing their access to the public data (and/or use of APIs (as applicable)) for the purpose of the approved research, including terms related to the security and legal use of personal data that may be made available to you under General Data Protection Regulation.” See also Pinterest, Researchers intake form, Help Center https://help.pinterest.com/en/landing/researchers-intake- form [accessed March 25, 2024] “By submitting this form, you understand and agree that your application, if successful, will be subject to additional Terms and Conditions to be shared by Pinterest with you.” 57 Snap, Researcher Data Access Instructions, Privacy and Safety Hub https://values.snap.com/privacy/transparency/researcher-access [accessed March 25, 2024]. 58 TikTok, TikTok Research API Terms of Service, https://www.tiktok.com/legal/page/global/terms-of-service- research-api/en [accessed March 25, 2024]. 59 X Developer Platform, Developer Agreement, (updated November 14, 2023) https://developer.twitter.com/en/developer-terms/agreement [accessed March 25, 2024]. 60See YouTube, Program Terms & Conditions, (July 11, 2022), https://research.youtube.com/policies/terms/ [accessed March 25, 2024] “You agree to regularly refresh Program Data as specified by the Developer API ToS (i.e. every 30 days) until such time as you need the Program Data to be fixed as to a point in time for the purposes of finalizing your analysis and drawing of conclusions with respect to your Researcher Publications.” 61 Meta for Developers, Content Library and API, Data deletion, https://developers.facebook.com/docs/content-library-api/data-deletion [accessed March 25, 2024]. 62 YouTube, Program Terms & Conditions, (July 11, 2022), https://research.youtube.com/policies/terms/ [accessed March 25, 2024]. 63 TikTok, TikTok Research API Terms of Service, https://www.tiktok.com/legal/page/global/terms-of-service- research-api/en [accessed March 25, 2024]. 64 Meta, Product Terms for Meta Research Tools, Transparency Center, (February 2, 2024), https://transparency.fb.com/researchtools/product-terms-meta-research [accessed March 25, 2024]. 65LinkedIn, Additional Terms for the LinkedIn Research Tools Program, August 21, 2023 https://www.linkedin.com/legal/l/research-api-terms [accessed March 25, 2023]. 66 See YouTube, Program Terms & Conditions, Definitions, https://research.youtube/policies/terms/ [accessed March 25, 2024]. See also TikTok, TikTok Research API Terms of Service, https://www.tiktok.com/legal/page/global/terms-of-service-research-api/en [accessed March 25, 2023]. Seealso "efforts to ensure" in Meta, Product Terms for Meta Research Tools, Transparency Center, (February 2, 2024), https://transparency.fb.com/researchtools/product-terms-meta-research [accessed March 25, 2024]. 67 TikTok, Global Data Sharing Research Appendix https://www.tiktok.com/legal/page/global/global-data- sharing-research-appendix/en [accessed March 25, 2024]. 68 Meta, Product Terms for Meta Research Tools, Transparency Center, (February 2, 2024), https://transparency.fb.com/researchtools/product-terms-meta-research [accessed March 25, 2024]. 69 See X Developer Platform, Developer Agreement, (updated November 14, 2023) https://developer.twitter.com/en/developer-terms/agreement [accessed March 25, 2024] full clause “User Protection. Unless explicitly approved by X in writing, you shall not use, or knowingly display, distribute, or otherwise make X Content, or information derived from X Content, available for purpose of: (a) conducting or providing surveillance or gathering intelligence, including but not limited to investigating or tracking X users or X Content; (b) conducting or providing analysis or research for any unlawful or discriminatory purpose, or in a manner that would be inconsistent with X users' reasonable expectations of privacy; (c) monitoring sensitive events (including but not limited to protests, rallies, or community organizing meetings); or (d) targeting, segmenting, or profiling individuals based on sensitive personal information, including their health (e.g., pregnancy), negative financial status or condition, political affiliation or beliefs, racial or ethnic origin, religious or philosophical affiliation or beliefs, sex life or sexual orientation, trade union membership, X Content relating to any alleged or actual commission of a crime, or any other sensitive categories of personal information prohibited by law.” 70 Google, Google Researcher Program Acceptable Use Policy, Google Request Records, https://requestrecords.google.com/researcher/policy [accessed March 25, 2024]. 71 See TikTok for developers, TikTok Commercial Content API Application, https://developers.tiktok.com/application/commercial-content-api [accessed March 26, 2024] "how to apply for access." See also Google Cloud, Google Ads Transparency Center https://console.cloud.google.com/marketplace/details/bigquery-public-data/google-ads-transparency-center [accessed March 26, 2024] “to use BigQuery, user must have a Google account and create a GCP [Google Cloud Project] project.” See also LinkedIn Ad Library, LinkedIn Ad Library API Program https://www.linkedin.com/ad- library/api [accessed March 26, 2024] “create a developer application in the Developer Portal.” See also Meta, Meta Ad Library API, https://www.facebook.com/ads/library/api/?source=onboarding [accessed March 26, 2024] “Create a Meta for Developers account.” 72 LinkedIn Ad Library, LinkedIn Ad Library API Program https://www.linkedin.com/ad-library/api [accessed March 26, 2024] “Ad Library API is a vetted product, and can be gained access by requesting the product from our developer portal.” 73 TikTok for developers, TikTok Commercial Content API Application, https://developers.tiktok.com/application/commercial-content-api [accessed March 26, 2024]. 74 Apple, Ad Repository API,Version 1.0,(March 2024), https://developer.apple.com/support/downloads/Ad- Repository.pdf75 Apple, Ad Repository API,Version 1.0, (March 2024), https://developer.apple.com/support/downloads/Ad- Repository.pdf (number of impressions is not included in the code book or in the user interface). 76 See Google Ads Transparency Center, Have questions? We’ve got answers.” https://adstransparency.google.com/faq?region=US [accessed March 25, 2024] “We include ads served from verified advertisers across Search, Display, YouTube and Gmail.” 77 See Apple, Ad Repository, https://adrepository.apple.com/ [accessed March 26, 2024] “in select EU countries and regions.” See also X Business, Ads transparency, https://business.x.com/en/help/ads-policies/product- policies/ads-transparency.html [Accessed March 26, 2024], “for ads served in the EU.” See also Microsoft, Ad Library https://adlibrary.ads.microsoft.com/ [accessed March 26, 2024], “ads served on Bing in the European Union (EU) and European Economic Area (EEA).” See also Pinterest, Ads Repository, https://ads.pinterest.com/ads-repository/ [accessed March 26, 2024] (the countries in the picklist are European). See also Booking.com, Booking Ads Frequently asked questions, https://www.booking.com/ad- repository/faq.html [accessed March 26, 2024] “ads shown on Booking.com in the European Economic Area (EEA) member states, including the European Union.” See also TikTok, Find ads on TikTok https://library.tiktok.com/ads [accessed March 26, 2024]. 78 See Google Ads Transparency Center, Have questions? We’ve got answers.” https://adstransparency.google.com/faq?region=US [accessed March 25, 2024] “Due to regulations, ads shown in Europe, Türkiye, and political ads have extra info, like the topic of the ad and more audience selection details.” 79 Meta, Ads about Social Issues, Elections or Politics, Transparency Center, https://transparency.fb.com/policies/ad-standards/siep-advertising/siep [accessed March 26, 2024]. European Digital Media Observatory Report on EDMO Workshop on Platform Data Access for Researchers Lisa Ginsborg Contributors: Kalina Bontcheva Valentin Châtelet Philipp Darius Matt Motyl Andreas Neumeier September 2024 Report on EDMO Workshop on Platform Data Access for Researchers Report on EDMO Workshop on Platform Data Access for Researchers About the Workshop 3 Background: Data access provisions in EU regulation and self-regulation on disinformation 3 Workshop programme 4 Key Takeaways from the EDMO Workshop 4 Findings from surveying the EDMO Network on researcher data access 6 Access to data for researchers under DSA Art. 40(4), 40(12) and EDMO’s work on data access 7 Online Platform Data Collection and Access 8 Experiences challenges and opportunities with platform APIs 10 TikTok Research API 10 YouTube API, Google Search and other platform APIs 11 Meta Content Library and API 13 The European Digital Media Observatory has received funding from the European Union under contract number LC-01935415 2 www.edmo.eu Report on EDMO Workshop on Platform Data Access for Researchers About the Workshop On 15 May EDMO organised a workshop on the topic of Platform Data Access for Researchers, at the Brussels School of Governance. The workshop (led EUI STG in collaboration with Globsec) took place in hybrid format and was attended by over 50 participants in person and 25 online, including representatives from the EDMO Network, civil society organisations, as well as EC representatives and some VLOP representatives. The workshop aimed to bring together the research community to discuss and share their experience and views on platform data access for research, including newly developed platform APIs, the type of data available and its accessibility, and whether the new research data access provisions under the DSA meet the needs of the research community. Speakers from the research community (including representatives from the EDMO.eu and Vera.ai projects, the Integrity Institute, the Digital Forensic Research Lab, and the Center for Digital Governance, Hertie School) provided key insights and shared concrete experiences with regard to new and old data access tools provided by VLOP and VLOSEs as well as potential opportunities and challenges for research. Given the key role of independent research in enabling transparency and independent oversight and a deeper understanding of the disinformation phenomenon, especially in the context of new technological developments and upcoming elections worldwide, wider data access for research organisations conducting independent research in the public interest remains a key priority for the EDMO community. Background: Data access provisions in EU regulation and self-regulation on disinformation In the 2022 Strengthened Code of Practice on Online Disinformation many online platforms explicitly committed to provide access, wherever safe and practicable, to “continuous, real-time or near real-time, searchable stable access to non-personal data and anonymised, aggregated, or manifestly-made public data for research purposes on Disinformation through automated means such as APIs or other open and accessible technical solutions allowing the analysis of said data.” Such voluntary commitment has now become a legal obligation for VLOPs/VLOSEs under DSA article 40(12) which explicitly requires platforms to provide access to publicly accessible data, and when technically possible to real-time data. The Delegated Act on data access which was scheduled to be adopted in spring 2024 has not yet been released. Data access provisions for researchers present a rapidly changing area, with a number of new tools being developed by platforms especially through researcher specific APIs to allow access to researchers to real time data. Yet a number of limitations appear to be currently hindering their use by the research community, including lack of awareness of the new tools, limited access for civil society researchers, complicated or lengthy application procedures and potential legal risks deriving from their use. As a result, the uptake of APIs appears slow and piecemeal, with very few European researchers actively 3 www.edmo.eu Report on EDMO Workshop on Platform Data Access for Researchers using such APIs at present. Although greater uptake may be expected as the tools continue to be rolled out and refined, as most tools are still being tested it is essential to better understand their utility and how they compare to previous instruments. Workshop programme Moderator: Claes de Vreese | EDMO Executive Board, University of Amsterdam Framing the issues surrounding data access for research Speakers: • Lisa Ginsborg | EDMO, School of Transnational Governance, EUI • Rebekah Tromble | George Washington University, EDMO • Matt Motyl | Integrity Institute Exchange of experience using current platform APIs Speakers: • Kalina Bontcheva | University of Sheffield, BROD, vera.ai • Valentin Châtelet | DFRLab, Atlantic Council • Philipp Darius | Center for Digital Governance, Hertie School • Andreas Neumeier | Bundeswehr University Key Takeaways from the EDMO Workshop Despite the DSA having entered into force, and its key provisions with regard to data access for research purposes as contained in Art. 40, in practice data access for researchers has not yet seen a significant overall improvement to date. In fact, in certain areas or in relation to specific tools data access for researchers appears to present significant limitations and in certain cases may have even deteriorated over the last year. While this may be temporary as the new instruments become operational, and before the delegated act on data access is adopted, a number of positive developments may also be noted in this area. These include public statements by the EC that the interpretation of Art. 40(12) appears to enable non-permissioned scraping,1 and the fact that researcher data access provisions under Art. 40(12) now exist for the large majority of VLOPs at least on paper, although in many cases the details of the concrete programs and data available are still missing. In practice current data access for researchers by platforms remains limited to date, with some platform APIs performing better than others while a number of significant shortcomings remain and are presented in the current report. Among such limitations researchers report limited accessibility, complex application procedures which include significant risks with regard to liability and fines, and the requirement for applications to 1 See for instance Press Release of 12 July 2024 ‘Commission sends preliminary findings to X for breach of the Digital Services Act’, available at https://ec.europa.eu/commission/presscorner/detail/en/ip_24_3761 4 www.edmo.eu Report on EDMO Workshop on Platform Data Access for Researchers be linked to specific projects rather than approved at organizational level. The legal requirements for researchers imposed in the contracts by the platforms should therefore be clear and not prohibitive for smaller research organisations, in light of the risk of large fines and the lack of existing mechanisms to protect research organizations from such risks. Equality in terms of data access is also key: online platforms and search engines should be discouraged from providing data and/or funding to only a small group of researchers (on specific issues or from selected countries), selected by the companies themselves in a non-transparent manner. Leaving aside the specific problems of each platform API, which are presented in detail in the current report, greater standardisation in data access API provisions by similar kinds of platforms and search engines is fundamental for researchers to be able to conduct cross-platform research, which is central to studying disinformation in a holistic way. In particular it is crucial for similar types of VLOPs and VLOSEs to provide similar types of data via their APIs in order to allow for comparability across platforms (e.g., TikTok and Instagram; Bing Search and Google Search). While researchers made it clear that platforms providing researchers with data through their APIs should be seen as a very welcome development as few researchers appear to be gaining access to some of the platform APIs at present, data quality and reliability also appears to be a concern among researchers when using these APIs. This underlines the need for an independent entity that could be testing and ensuring data quality under DSA Article 40(12), and potentially under Articles 40(4); 40(8), as it is clear that this cannot be done by individual researchers. With regard to the way data is accessed through specific APIs, streaming/real-time APIs (e.g., previous Twitter API) allowing to get continuously new sets of data may be preferable to APIs in which periodic downloading is required (YouTube and TikTok) as this may also limit the kind of research that is possible. Finally the clean room approach of the Meta Content Library and API is highly restrictive with respect to repeatability and open science, also in light of the fact that the clean room is cleaned after a certain period of time and the requirement of access through VPN is also problematic for data transparency. Overall it is essential that while new tools are rolled out and tested, data access for researchers should improve urgently and continue to do so over time for all VLOPs and VLOSEs while ensuring the information provided to the research community is clear and user friendly. Imposing fees for data access for researchers does not allow for such incremental access. With regard to DSA Art. 40(4), the system will only become fully operational once the Delegated Act is released. The work done (and presented during the workshop) by the EDMO Working Group for the Creation of an Independent Intermediary Body to Support Research on Digital Platforms (led by Dr Rebekah Tromble), as well as the draft Code of Conduct on how platforms can share data with independent researchers while protecting users’ rights will play a key role going forward. Collaboration will be key for researchers to ensure research priorities are prominent and the processes and institutions, including relevant Digital Services Coordinators (DSCs) are not overwhelmed. In this respect, the importance of researchers being aware of the variety of data collected by platforms, which may be requested for research purposes, becomes even more pressing. Greater information sharing and relevant tools are essential in this respect, including the data dictionary currently being developed by the Integrity Institute, including keywords, phrases in metrics, and variables researchers might request for research purposes. 5 www.edmo.eu Report on EDMO Workshop on Platform Data Access for Researchers Findings from surveying the EDMO Network on researcher data access Lisa Ginsborg presented the results from a recent survey conducted by EDMO.eu, inter alia, with research representatives in the EDMO Hubs surrounding VLOPSEs implementation of their commitments on the empowerment of researchers under the Strengthened Code of Practice (CoP). The report from the survey describes a number of limitations faced by the research community with regard to access to platform data as a pre-condition for transparency and accountability that is long overdue. In particular the Research Survey conducted by EDMO shows that data transparency and data access through APIs (Commitment 26) continues to be a key priority for the research community. Despite encouraging reports by all platforms on recent launches of new tools for researchers, the uptake of APIs appears slow and piecemeal, with very few European researchers among EDMO Hubs seemingly using such APIs at present. Several reasons may be ascribed to the lack of use of new APIs by EDMO Hubs, starting most obviously from their recent set up. Other reasons which emerged from asking researchers in EDMO hubs include lack of awareness of the new tools, complicated or lengthy application procedures and potential legal risks deriving from their use. While greater uptake may be expected as the tools continue to be rolled out and refined it is clear that simply launching an API for researchers may not be enough to meet the requirements of Commitment 26 and ultimately the DSA. Given most tools are still being tested, it remains difficult to understand their utility and how they compare to previous instruments. The training sessions, workshops, and collaboration environment for fact-checkers and researchers offered by CrowdTangle in the past (which allowed researchers to exchange experience, tools and methodology and provided a clear point of contact for researchers) constitute good practices in this regard. Researchers reported them to be especially useful around election periods. A number of problems were also reported by EDMO Hubs, including: length of the application process by platforms, making it difficult for researchers to realise the intended research projects within the funding period; the challenges of reconciling the contractual requirements with the conditions of independent research and the protection of employees by the university; as well as complicated authorisation procedures and concerns in respect of the current conditions imposed in contracts by the platforms, including the risk of large fines for research organizations and the lack of existing mechanisms to protect research organizations from such risks. A number of concrete recommendations also emerged from EDMO Hubs representatives from the survey including: • The need for platform tools and interfaces for data access to be rolled out fast; • The need to raise awareness about the newly developed APIs, including making the information clear and user friendly and providing training and support to the research community on these tools; 6 www.edmo.eu Report on EDMO Workshop on Platform Data Access for Researchers • The need for application procedures to be clear and for applications by researchers to be approved swiftly; • The suggestion for applications to be approved at organizational level, as opposed to being linked to specific projects; • The data provided should meet the needs of the research community by enabling incremental and near real-time access to data, which should also include providing researchers with information on data that is flagged as misinformative; • Allowing access to the APIs also for researchers from civil society; • Providing support to researchers entering legal contracts, e.g., establishment of a legal protection/insurance system for public research institutions, NGOs and independent researchers; • Greater standardization of APIs will allow the research community to use them more widely, rather than having to gain specific skills for each platform; • Going forward the question may be raised of whether specific tools could be developed which could enable the use of APIs by researchers from different disciplinary backgrounds beyond data scientists Access to data for researchers under DSA Art. 40(4), 40(12) and EDMO’s work on data access Rebekah Tromble provided a number of updates and insights on the state of state of data access under the DSA, in particular with regard to article 40(12) and article 40(4) and what may be expected going forward as well as EDMO’s work on data access in preparation for Art. 40(4) becoming fully operational. While some elements are still missing including relevant delegated acts and guidance from DCSs, there are a number of actions researchers may already start working on now. Dr. Tromble started by emphasising some of the positive developments in light of DSA Art. 40(12), including the public statement from a number of key members of the EC that their interpretation of Art. 40(12) is that it enables non-permissioned scraping. Programs exist with regard to data access under Art. 40(12) for the large majority of VLOPSEs at least on paper, although in many cases the details of the concrete programs and data are still missing. Both the platforms and the researchers need more guidance from the EC about the requirements for such programs, but this is unlikely to come before the Delegated Act for Art. 40(4) is released, given the need for harmonisation between the different provisions. EC interest in this area is demonstrated by its inquiries to all of the VLOPs and VLOSEs for more information about their Art. 40(12) programs, as well as by the formal investigations into X and Meta, including for their compliance with Art. 40(12). Other organizations are also stepping in, including EDMO, the Institute for Data, Democracy & Politics with its Tracker aiming to describe existing research access tools, the Coalition for Independent Technology Research with its DSA Data Access Audit and Survey, which aims to gather information about the experience of researchers with the data access programs and in the future aims to work with researchers to put together applications in order to test the systems directly. 7 www.edmo.eu Report on EDMO Workshop on Platform Data Access for Researchers With regard at Art. 40(4), researchers who are affiliated with research organizations as defined by the EU Copyright Directive will have the ability to apply for specific non-public data under Article 40(4). While the exact details of this still need to be clarified, it is clear that both civil society and academic organizations that have a substantial research mission will qualify under Article 40(4). It remains to be clarified whether non-European research institutions will qualify, raising the importance of collaboration going forward. The data researchers can apply for must be ‘in scope’, applying to European systemic risks and potential mitigations for those risks. The request must be proportionate, the data must ‘exist’, must not jeopardise trade secrets, and must comply with GDPR. The EDMO Draft Code of Conduct aims to help researchers and regulators in this respect and is providing a partial blueprint for the delegated act, and aims to provide further guidance also to the DSCs. With regard to the application process, this will be further clarified in the Delegated Act, but is likely to see researchers submitting their applications to local DSCs, the DSC in most instances would send the application to an independent intermediary body to provide an advisory opinion, which the EDMO Working Group for the Creation of an Independent Intermediary Body to Support Research on Digital Platforms has been working on. The local DSC would then send the recommendation to the DSC of establishment (in most cases the Irish DSC), which would make the final decision. If positive, designated researchers would be involved as “vetted researchers” and make “reasoned request” for the data of platform(s). Platforms will then have 15 days to respond, and may also object either because they don’t have the data or on trade secrets grounds. A negotiation process may be expected in this context involving the regulators, the platforms and the researchers themselves with the likely involvement of the intermediary body. Potential hurdles and unknowns remain, including in relation to how to tailor the GDPR risk assessment in relation to the data received, assessing who qualifies and what qualifies for research projects, what might the platform fees be if any, but the biggest potential hurdle is overwhelming the regulators with data requests. It is therefore important for researchers to provide feedback to the regulators but most importantly cooperate with one another to not overwhelm the system. Once the intermediary body is established, it aims to bring researchers together for consultations in Fall 2024 to identify the most pressing data needs and priorities to be shared with DSCs. Online Platform Data Collection and Access Matt Motyl from the Integrity Institute aimed to support researchers outside of industry to learn about what data is currently collected by the relevant companies. This includes the obvious data provided by users, the less obvious data extracted by platforms, and additional data that is learned about users often through machine learning, AI or by buying data from third parties. While the presentation started from Facebook, the lessons were seen to be applicable to most social media platforms. Dr. Motyl, presented some of the categories of data users normally provide to Facebook, either directly or through their consumption patterns or posts and additional data platforms may extract from the data provided by users. Further user behaviours on the 8 www.edmo.eu Report on EDMO Workshop on Platform Data Access for Researchers platforms (e.g. time spent on platforms, clicks including on ads, purchases, reactions, comments, reporting behaviour, etc.) provide signals to platforms on content to be fed to users. Time spent engaging in such behaviours is also logged. Further data is provided on how users interact with each other including in relation to all of the behaviours above. Less obvious data that platforms extract about users includes device metadata, from which they can extract where people are located, what network they are on, whether they use multiple devices, whether multiple people use the same devices for access to those platforms, GPS location, and all kinds of network information, including phone numbers and names, whether they are also on that platform, and from there establish networks, which can help establish by way of example what people in specific networks may buy, or even instances of coordinated inauthentic behaviour. More sophisticated information that may be collected includes the probability users are real and are who they say they are, probability users will buy specific things, who users are likely to vote for, what parties they may belong to or how likely they are to show up to a political event, whether the user is working in coordination to interfere with another country’s election, whether the user produces unwanted social interactions. Most of the above information will exist at the user level but also at the level of content, group, ad, page, etc. Dr. Motyl also introduced the way data is stored by the platforms, in particular the two main types of categories for data storing: dimension tables and fact tables. Dimension tables are mostly structured, i.e., containing one row per aggregated object (e.g., user, post, page, post, video) with a key id and one column per variable (and there may often be 1000s of variables). Fact tables, on the other hand, are mostly unstructured with often one row per event, which are not stored in columns but often in JSON strings, MAPs, Arrays (theoretically could be 0 to infinite length) with key-value pairs. For large platforms it is probably impossible to have a single table containing all variables, as it would probably involve many exabytes of data if not more. Different categories of variables can be stored in different tables and even different warehouses or cloud services. Joining data is therefore quite complicated. Further, while thousands of tables may exist per user, page, group, etc., these tables may also have different retention times, until the data is transferred to a different format and becomes much more difficult to obtain. Different platforms will also have different data retention policies and even within a company, different security / privacy settings may apply to different tables and even columns. In light of the above accessing specific data may become quite complicated or take a long time. Matt Motyl and the Integrity Institute are currently building a data dictionary, including keywords, phrases in metrics, variables researchers might want, how to map users activities to these data table, community requests, with the aim that researchers will know what to ask for when they are requesting platform data, in order to not overburden regulators going forward. 9 www.edmo.eu Report on EDMO Workshop on Platform Data Access for Researchers Experiences challenges and opportunities with platform APIs TikTok Research API Dr. Philipp Darius from the Center for Digital Governance at the Hertie School and Andreas Neumeier from the Bundeswehr University in Munich presented their experiences, current requirements and problems in using the TikTok Research API. In a collaboration with the SPARTA project the team collected political party communication on Facebook, Instagram, TikTok, Twitter/X and YouTube during the EU elections. For accessing the TikTok Research API it took TikTok approximately 4 weeks to grant the application for the Research API. The TikTok Research API, it was made accessible to European researchers on July 20, 2023. The websites entailed contradicting information on whether the API is available only to European researchers or to researchers globally that focus on European systemic risks. The stated aim is to ‘support research and increase transparency ’and a collaboration team with up to 10 researchers in one lab on the developer platform may be formed, which should enable pooling of API keys to increase the quotas of API requests. TikTok has also initiated a commercial content library that includes ads, advertiser metadata and targeting information. In order to get access to the Researcher API an application form needs to be submitted, and the researchers must adhere to the Community Guidelines and the TikTok Research API Services Terms of Service. The research proposal needs to be approved by the research institution's ethics committee, there should be no conflict of interest and no commercial purposes and the researcher should have demonstrated experience and expertise and be employed by a not-for profit organization in the EU or in the US. With regard to the data that can be collected with the TikTok Research API, researchers can generally collect content on public accounts by “creators” or users with public profiles and who are aged 18 and over. Information may in theory be collected on videos, comments, users, liked/reposted/pinned videos, follower and following lists (if following list is made public). Dr. Darius went on to describe their experience of using the TikTok API in practice while working on a cross-platform study collecting communication on TikTok, YouTube, Instagram, Facebook, and X by all European political parties participating in the 2024 EU elections. Especially for the TikTok side of the collection they have been facing serious issues regarding the data retrieved via the TikTok Research API. In particular, the data provided between March and June 2024 cannot be used for research on party campaign communication behaviour as it deviates strongly from metrics shown in the app or on the webpage as TikTok user interfaces. Mr. Neumeier proceeded to provide concrete examples of the problem experienced, in which comparing the JSON retrieved from the API to the relevant TikTok webpages a number of disparities could be noticed in particular with regard to the likes count, comments count (which in the API appears to include the favorites count), the share count, and the view count in which very significant differences 10 www.edmo.eu Report on EDMO Workshop on Platform Data Access for Researchers can be noted. Overall, the data provided by the API on some of the metrics was not reliable and cannot be used for their current research project. They faced the following issues: • The API drastically underreported view and share counts of videos in contrast to metrics visible on TikTok’s user interfaces on the website and in the TikTok App. • Conflation of comment count and favourites count metric. • The follower collection was interrupted at around 3,000 accounts, and there appears to be some kind of limit on it but the reason for this remains unclear. • API data is not available in real-time, as obligated in DSA 40(12) but with an approximately 10-days delay. The researchers have reached out to the TikTok API support team, and after some time they have now received availability to schedule an appointment to discuss the problems with the data and potential underlying issues with the API. The team thereafter also reported the issues to the Bundesnetzagentur, as the German Digital Service Coordinator, and staff members of DG Connect. While biases of other research APIs (e.g. Twitter’s Streaming API (see Morstatter et al., 2013) are known, in the case of the TikTok API more than a bias, the data was faulty. In July 2024 the team found that the TikTok Research API seems to have been repaired and provides only marginally deviating metrics from the user interface (which is acceptable for distributed systems). However, the identified problems underline the need for an independent entity that could be testing and ensuring data quality under DSA Article 40(12), and potentially under Articles 40(4); 40(8), since individual researchers cannot test all the data provided by platforms via APIs or via data access requests. The data tracker by the Weizenbaum Institute was also mentioned as a good practice that could be combined with other data access trackers to bundle researchers’ experiences when working with platform data. YouTube API, Google Search and other platform APIs Kalina Bontcheva from the University of Sheffield provided an account and reflections on her experience with data access at the University of Sheffield and as part of the Vera.ai EU-funded project. In particular she presented her experience with the YouTube API as well as a number of other APIs. In terms of positives the YouTube API gives access to the video descriptions, the subtitle data (if present), channel descriptions, etc. Based on the video ID resarchers can retrieve all comments for that video, which is important when analysing responses to a specific video that is spreading disinformation. The video comments also tend to have accurate timestamps. However, a number of issues were also encountered with regard to the YouTube API, which could be improved over time as the dialogue on data access with platforms 11 www.edmo.eu Report on EDMO Workshop on Platform Data Access for Researchers progresses. When retrieving comments, all comments always appear, and there is no way of retrieving only new comments posted since a certain date, which would be more efficient with respect to the usage on the researcher API data quota. Similarly, to check for new replies to existing top-level comments researchers must query each top-level comment individually to get all of its children and then work out which ones are new. Researchers would benefit from an API call that gave all top-level and reply comments on a given video that are dated after a given timestamp. Further, the "search" API (to find videos matching search terms, or all videos posted by a given channel) is extremely expensive in terms of quota. The daily quota is thereby exhausted after just a few searches, which put limits on the speed at which research may be conducted. While there is an explicit provision stating that “On request, and with sufficient justification, you will receive sufficient API quota for use as specified by this Program ToS”, it would be important to clarify the protocol for increasing the quota. The biggest problem, which may relate also to the TikTok API, is that when the data is downloaded it is a snapshot at the time it was downloaded. Differently from the previously available Twitter streaming API, it is therefore not a streaming/real-time API allowing to get continuously new sets of data, but periodic pulling is required and data may therefore be a bit out of date, which also limits the kind of research you can do. Prof. Bontcheva also presented her experience with other platform APIs, starting from the TikTok Research API, for which their application was successful and the researchers are getting rich data via their comprehensive research API and documentation, despite some of the issues with data quality discussed earlier. Bing provides access to thematic datasets for Search, which from the perspective of a scientist are very important as they allow open science and repeatability and it means that different researchers can compare results also with regard to their AI models. Bing also provides API access for news, web, image, and other search, which is very valuable. Finally from Google Search it is possible to get access to Google Trends, the Fact-check API and the Ad library, which have been available for some time. Google Search however does present a number of access limitations. In particular, researchers would benefit from access to free quotas to the Google Custom Search JSON API, which would seem easy to provide, as in the case of YouTube. Further, Google search thematic collections would also be extremely valuable for researchers comparable to those from Bing. This is important because disinformation does not live on one platform and researchers need to be able to answer cross-platform research questions, e.g., are citizens exposed to disinformation in their search results, irrespective of which search engine they use? What is the difference between search results provided by the different search engines that are signatories to the CoP? At present it is impossible to answer those types of questions, because different search engines are interpreting the idea of research data access in very different ways. Shared research collections across platforms are very important for open science and repeatability and provide insight into the history. Further, while APIs may of course differ across platforms it is important to have access to the same kinds of data in order to allow for comparability across platforms. The importance of cross-platform research capabilities makes it essential for similar types of VLOPs to ideally provide similar types of data via their APIs that allows researchers to carry out cross-platform research. By way of example the clean room approach of 12 www.edmo.eu Report on EDMO Workshop on Platform Data Access for Researchers Instagram is highly restrictive with respect to repeatability and open science, also in light of the fact that the data access provisions as of April 2024, meant that the clean room is cleaned after a certain period of time. Research experiments should be repeatable over time and the methodology should be documentable. At present a number of research questions that EU-funded research projects on disinformation aim to investigate (e.g., Comparative impact analysis of disinformation videos/images on popular video and image sharing platforms) cannot be answered because the data access provisions by Instagram and TikTok are very different. Further, with the Instagram clean room it is not currently straightforward to import other URLs, e.g. debunked disinformation from other platforms, making potential research highly limited. Prof. Bontcheva also highlighted that in certain European countries the majority of political campaigning continues to take place on X/Twitter. Governments use it as a platform to communicate with the electorate, but it is no longer possible for researchers to study it. In particular researchers in the UK have received numerous requests from policy makers, regulators, and government bodies for in-depth, large-scale quantitative research on the prominence and impact of disinformation in public political discourse, which can no longer be satisfied as the X/Twitter API is no longer available for free to UK researchers. Further, data access for researchers from associated countries who are part of Horizon Europe projects such as the UK is currently being restricted by X, which is currently flatly rejecting all UK applications claiming that DSA provisions do not extend to UK researchers, even when working on EU projects funded by the EU and looking into disinformation in the EU. Finally equality in terms of data access among researchers is essential, and online platforms and search engines should be discouraged from providing data and/or funding to only a small group of researchers (on specific issues or from selected countries), selected by the companies themselves in a non-transparent manner. The importance of collaboration among researchers was further emphasised as well as the need for a shared understanding among all stakeholders on data access and research and its essential role for society and democratic integrity. This should include positive engagement by policy makers and platforms, beyond current narrow concerns about reputational risk. Lastly, in terms of data access provisions, and their monitoring, there is a need for more prominent input from AI/CS experts on the technological and standardisation aspects of data access provisions and their adequacy, which can have huge implications for research. Meta Content Library and API Valentin Châtelet from the DFRLab introduced the current data access landscape for Meta. As is well known Meta is decommissioning CrowdTangle (CT) starting from 14 August 2024, which is widely used in the research community including the participants to the workshop. In relation to this, the EC has initiated an enquiry on the potential infringement to the DSA of Meta’s current data access policy and tools. While none of the participants to the workshop, including Mr. Châtelet, had access to the Meta Content Library and API, he shared insights on the experience of some of his colleagues at the DFRLab. Meta also provides access to researchers to certain datasets, in particular to take-downs and CIB related deplatforming. 13 www.edmo.eu Report on EDMO Workshop on Platform Data Access for Researchers The procedures to receive access to the Meta Content Library and API were reported by Mr. Châtelet to be very long. Applications are reviewed by the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan. This includes questions about the researcher’s experience of handling confidential data as well as large datasets, and optional documentation from an Ethics Committee or Institutional Review Board (IRB). The Content Library is a Graphic User Interface (GUI) to query content from Meta, while the API allows to query through code, either R or Python, to public accounts with more than 25,000 followers, including public Pages, public groups, public events, public profiles, and even comments according to Meta’s documentation. Once researchers gain access the Content Library interface appears like a search engine, which includes the following sections: • Influence operations datasets, including CIB and other deplatformed taken-down content that is downloadable as CSVs • Saved searches: textual search queries that researchers can save for later use and continued monitoring • Producer list: collection of pages that populate a collection • Downloads (history of CSV files that have been downloaded by researcher) The Datasets section resembles the collections of CIB datasets from CT, but in terms of labelling contains significantly less information regarding how the content has been edited (or removed) when compared to CT, and explanations about what are the grounds justifying the existence of the ‘dataset’ also do not appear to be available. On the positive side, the CSV files of the datasets (including Meta takedowns and CIB) may be downloaded, as well as trend graph, which show the evolution of the amount of posts found when users query the search engine of the Content Library. In terms of limitations the Meta Content Library does not allow access to cross-platform searches. Each query on the search engine will therefore only be able to search for that specific query on Facebook or on Instagram, not on both. In this respect the Content Library is different from the Meta Ad Library which gives the opportunity to search across both platforms. In terms of the search interface using the textual search, researchers will still have access to several filters which were available on CT. While the results of the search can be saved in the personal account, they cannot be downloaded as a CSV file and it appears to be a less potent search engine, with no cross-reference search with Facebook & Instagram. Another limitation of the Meta Content Library is that researchers have access to even less search operators. Since the workshop, the use of quotes to perform literal searches was implemented, however, the search engine still does not support Boolean operators. At the time of the workshop, the use of quotes to perform literal searches was not available. In addition, handles and URLs of contents cannot be used to perform searches, which presents a significant setback compared to CT. Finally, there appears to be less search and query results than the numbers indicated for each search. The same appears to be true for the Meta Ad Library, where the number of Ads indicated is expressed in ranges instead of giving researchers an exact figure. A final significant limitation is that to access the Meta Content Library requires the use of a VPN, casting potential doubt on the monitoring of researchers’ activity and data on the platform. 14 www.edmo.eu Report on EDMO Workshop on Platform Data Access for Researchers DFR Lab observes that the amount of public data that is available via Meta tools increasingly lacks cross-compatibility with existing Meta products. Regarding the Meta Content Library and API, the Unique ID provided when searching for a specific Facebook page is only created within the Meta Content API and cannot be used to search on Meta’s technologies, including Facebook or Instagram, or and don’t provide access to the assets’ page using the URL facebook[.]com/. The community also expressed worry regarding the contractual terms for accessing Meta research tools from a legal perspective especially for research institutions with fewer resources. Civil society actors such as the media and journalists are not intended to be granted access to the Meta Content Library and API. In addition, there is a requirement regarding publications to notify Meta of any forthcoming publication, which also raises issues with concern to the independence of research.2 2 See https://transparency.meta.com/researchtools/product-terms-meta-research last accessed 23 July 2024 15 www.edmo.eu Report on EDMO Workshop on Platform Data Access for Researchers www.edmo.eu The European Digital Media Observatory has received funding from the European Union under contract number LC-01935415 16 www.edmo.eu 1 Public Data Access Programs: A First Look Assessing Researcher Data Access Programs Under The Digital Services Act Executive Summary Overview This report represents an in-depth effort to systematically evaluate the data access programs provided by the major online platforms who are regulated by the EUʼs Digital Services Act DSA. We look specifically at access to publicly accessible data. Data access for researchers, journalists, and NGOs is critical to ensuring that threats on these platforms, for instance to civic discourse or public security, are identified, the companies are held accountable to the public, and ultimately that the rights of citizens are protected. In this report we provide both a rubric for evaluating these programs based on a detailed set of criteria as well as the first set of scores using that criteria to produce a moment-in-time scorecard. We rate each platform based on an independent assessment of what the research community believes is needed to further research in the public interest. We examine what data these platforms make available to researchers, the usability of these systems, the accessibility of the programs to researchers, the terms and conditions applied to this access, and the security and privacy provisions they employ. Where possible we have tested and compared these systems to the public data we see on these sites, and otherwise we have reviewed public statements, forms, and documentation about each program. We have also reached out to each platform directly to ask clarifying questions and to seek additional information about access. Our grades reflect all of the platform offerings we could identify and examine as of May 2024 and represent a snapshot of the programs. Most of the programs detailed in this report are new, and several changed over the course of our research. We hope and anticipate that these programs will continue to expand and evolve through further use, scrutiny, and guidance by researchers and regulators. 2 Rankings Each platformʼs data access program was assessed using 47 measurement questions grouped into five categories: Quality, Ease of Use, Accessibility, Terms of Use, and Privacy and Security. The aggregate results rank the platforms by adding the average scores for each category together into one composite score: 3 Based on these findings, we identify key recommendations for regulators both in the EU, and more broadly seeking to increase the impact of their rules, for researchers exploring how to make use of platform data, and for the platforms themselves to improve their offerings. We anticipate that this process can be repeated periodically to evaluate improvements over time for public interest research. Key Findings Platforms have employed four distinct approaches to providing researcherʼs public data access We observed that platforms have varying approaches to sharing public data with researchers , that can be summarized as: 1. Formal programs: five platforms have created new and specific systems and tooling for researchers investigating systemic risks in the EU. 2. Existing programs and APIs: five platforms provide or have repurposed existing APIs and researcher programs that can be used in the context of DSA-related research. 3. Permission to scrape: several platforms explicitly or implicitly permit data scraping as a means of data access. 4. Data requests: three platforms invite researchers to request specific data sets on an individual basis but do not document what exists or provide a standard mechanism for access. 5. No efforts to provide access to public data? two platforms do not appear to have developed any formal mechanism for researchers to access public data at this time. Data scraping has emerged as an offering from several prominent platforms Alphabet Google Search, Play Store, Shopping, and YouTube) explicitly acknowledges data scraping as a means for enabling researchers access to public data. Booking.com, Amazon, and Pinterest implicitly allow data scraping for non-commercial purposes through their terms of service. By offering researchers permission to scrape data, these platforms model a valuable approach and normalize the concept which may increase the value of scraping as a tool for public interest research in the future. Public data access programs are new, and often lack visibility and documentation Platforms have only recently introduced new programs or adjusted existing ones. At this time, few researchers have gained access, and fewer still have systematically probed the data to understand the benefits and limitations of the programs. Critically, access to 4 information, data documentation, and technical support all appear to be extremely limited at this time and in some cases very difficult to find even when they do exist. We lack shared definitions for “public dataˮ and who should get access to it There exists no clear or agreed upon definition of what qualifies as “public.ˮ Each platform has made different decisions about what data to share, and in many cases those choices may differ from researcher expectations. In addition, platforms have varying or ambiguous criteria for who is eligible to access data. This lack of clarity and standardization makes it difficult for the platforms to provide the best offerings and for researchers to conduct research. Rapid response and exploratory research are not readily enabled A significant portion of the research programs offered at this time require detailed applications and vetting or specific and time-bound research questions. The response time for access applications is variable, and rarely immediate. As a result, research into real-time emerging threats is limited and challenging at this time. This is significant as a primary use case of public data access is for rapid response and monitoring research, as for example with Metaʼs CrowdTangle or Twitter/Xʼs API prior to May 2023. Technical and policy limitations on data access may hinder research quality Many of the programs impose rate limits on the volume of data that may be accessed, the speed at which data may be collected, and the exact data that may be stored locally by researchers. These limitations are likely to limit research that aims to define the scale of specific risks on a platform, or that processes and analyzes media posted to a platform. Topline Recommendations We have identified 25 recommendations for regulators, platforms, and researchers based on the findings of this research which are detailed later in this report. Our highest priority recommendations for regulators and platforms are outlined below: Regulators should provide clear frameworks for sharing public data Regulators globally should provide clear and harmonized frameworks for public data sharing between tech companies and public interest researchers. Such frameworks should clarify what constitutes public data for public interest research purposes. They should also ensure safe harbor for independent data collection for public interest research. Currently, the European Union has the most developed legal framework under the Digital Services Act. While some questions related to public data sharing will likely be addressed in the forthcoming Delegated Act on Article 40, the European Commission should provide 5 further guidance in relation to data sharing under Article 40.12. This guidance should, inter alia, clarify what constitutes “publicly accessible dataˮ for these research purposes and set a baseline for the specifications and documentation of data sharing, so that researchers, auditors, and regulators can effectively evaluate these programs. Several major platforms already permit data scraping as a form of data access for research, presumably because this is both easy to implement and demonstrates a clear connection with what is “publicˮ. Critically, guidance should also ensure that researchers physically located outside of the EU will have access to data needed to conduct research related to systemic risks in the European Union. Platforms should facilitate real or near-real time monitoring Exploratory and rapid-response research requires timely and flexible access to public data in response to real-world events. This work is often done by non-academically affiliated researchers including journalists. Platforms should facilitate the use of public data for real or near-real time monitoring and ensure that all relevant researchers conduct this research. Platforms should allow organization-based access to research programs, in addition to project-based access. Interactive dashboards are an ideal mechanism for empowering non-technical researchers, particularly from journalism and civil society, to investigate public data in real-time. Researchers should actively request data and document their experiences. Many of the platforms we evaluated reported to us that they have received very few requests for data. While the “vettedˮ researcher process under the Digital Services Act is still being built, the public data sharing programs are already in place. Academic researchers, civil society researchers, and journalists need to explore and make data requests in order to expand a shared understanding of what kinds of research can be accomplished with platform data, and to develop a clearer sense of what kinds of data are on the menu for interrogation. Many of these programs are not widely publicized - in some cases we only discovered the existence of a program after multiple efforts to contact a platform. Through various efforts to catalog the programs including this one, more information is now available to find and apply for access. The experiences with these public data programs will be helpful for better envisioning non-public, more heavily “vettedˮ data sharing programs. 6 In conclusion This report has tried to take a first look at the public data sharing programs of some of the worldʼs largest tech platforms. This report focuses on the new or revised data sharing offerings released in the context of the EU Digital Services Act Article 40.12, requiring designated services to share publicly accessible data, but we also consider data sharing more broadly, and so this report is not intended to be an assessment of compliance with DSA 40.12 programs. We find that reasonable progress has been made to increase transparency and access to public data by some of the most prominent platforms, but that significant work remains to effectively enable research across all the platforms that impact society. 7 Table of Contents Executive Summary.............................................................................................................2 Overview.................................................................................................................................... 2 Rankings.................................................................................................................................... 3 Key Findings.............................................................................................................................. 4 Topline Recommendations........................................................................................................ 5 In conclusion.................................................................................................................................... 7 Table of Contents................................................................................................................ 8 Background........................................................................................................................ 9 Article 40 of the Digital Services Act........................................................................................9 Approach.......................................................................................................................... 10 Criteria for Researcher Data Access Evaluation..................................................................... 10 Detailed Assessment Measures............................................................................................... 11 Platform Evaluation Strategy................................................................................................... 14 Observations..................................................................................................................... 16 Quality...................................................................................................................................... 16 Accessibility.............................................................................................................................20 Security and Privacy............................................................................................................... 22 Recommendations for public data sharing programs............................................................ 26 For Regulators......................................................................................................................... 26 For Platforms........................................................................................................................... 26 For Researchers.......................................................................................................................27 Challenges and Limitations................................................................................................ 28 Conclusion....................................................................................................................... 29 Platform Appendix............................................................................................................. 31 AliExpress................................................................................................................................ 32 Amazon Store.......................................................................................................................... 33 Appleʼs App Store.................................................................................................................... 34 Bing.......................................................................................................................................... 35 Booking.com............................................................................................................................ 36 Meta: Facebook & Instagram...................................................................................................37 Google Maps........................................................................................................................... 38 Google Play..............................................................................................................................39 Google Search.........................................................................................................................40 Google Shopping......................................................................................................................41 LinkedIn................................................................................................................................... 42 Pinterest...................................................................................................................................43 Snapchat..................................................................................................................................44 TikTok...................................................................................................................................... 45 Wikipedia................................................................................................................................. 46 X/Twitter.................................................................................................................................. 47 YouTube................................................................................................................................... 48 Zalando.................................................................................................................................... 49 8 Background Article 40 of the Digital Services Act Under the Digital Services Act DSA, the European Union initially designated 19 entities as Very Large Online Platforms VLOPs or Very Large Online Search Engines VLOSEs • ● AliExpress • ● Amazon Store • ● Appleʼs App Store • ● Bing • ● Booking.com • ● Facebook • ● Google Maps • ● Google Play • ● Google Search ● Google Shopping ● Instagram ● LinkedIn ● Pinterest ● Snapchat ● TikTok ● Wikipedia ● X (formerly Twitter) ● YouTube ● Zalando Further services have been designated since, but evaluating those additional platforms was beyond the scope of this research. These VLOP/SEs are subject to certain obligations under Article 40 of the DSA, including that they make public data accessible to researchers studying “systemic risks to the European Unionˮ and are “independent from commercial interests.ˮ Section 12 of Article 40 states: “Providers of very large online platforms or of very large online search engines shall give access without undue delay to data, including, where technically possible, to real-time data, provided that the data is publicly accessible in their online interface by researchers, including those affiliated to not for profit bodies, organisations and associations...ˮ This research is conducted in the context of Article 40.12 being in force, but looks more broadly at what access to public data is available. This assessment is intended to be understood as the interpretation of the research community needs, and cannot and should not be interpreted to assess compliance under the Digitals Services Act. 9 Approach To develop a standardized methodology for evaluating public data access programs, we first identified the primary criteria that are relevant to researchers in collaboration with a network of stakeholders, like civil society research organizations and independent platform accountability experts. We have defined five distinct categories of measures as the criteria employed throughout the evaluation. Criteria for Researcher Data Access Evaluation Quality An evaluation of the quality and comprehensiveness of the data provided, based on a range of criteria derived from real-world data usage for research (see below) as well as a detailed audit of the public data visible on each of the platforms. The grades for quality will help policymakers and the public to understand the breadth of the data they are sharing. Ease of Use An evaluation of the functionality of the system that provides data, based on real-world researcher practices and concerns. This set of grades will examine the practical usability of the system and data compared to expectations and best practices. These grades will demonstrate whether or not the tooling and limitations provided by the platforms are sufficient to actually conduct the critical research protected by the DSA. Accessibility An evaluation of the processes for gaining access to public data by researchers and other stakeholders. These grades will make clear whether or not the platforms are facilitating broad, diverse, and meaningful access to the community that DSA Article 40.12 is, in our view, designed to serve. Terms of Use An evaluation of the terms under which researchers can use the data. This set of grades will look both at how well the terms communicate what researchers can and cannot do, as well as compare those provisions to the practical concerns researchers have when conducting and publishing research. Privacy and Security An evaluation of the provisions established by the platforms and the provisions imposed on researchers in order to maintain the security of the data and protect the privacy of individuals who create or are referenced in the data available for research. 10 Detailed Assessment Measures For each category, we developed a set of assessment measures in collaboration with a network of stakeholders including civil society researchers, academics, and independent platform accountability experts, who have studied these online platforms and made use of the various methods for accessing and analyzing public platform data. For each measure, we developed a description, guiding questions, and a specific assessment question that was answered for each platform. Each assessment question is designed to be answered affirmatively for a pro-researcher outcome. Does the platform provide access to the standard public data that is visible to users? Does the platform provide access to data that is inferred about or appended to users and content? Does the platform provide access to aggregate data about users and content? Does the platform provide access to media content for analysis? Does the platform provide real-time data on activity? Does the platform provide full historical access to its data? Does the platform provide granular access to individual data attributes? Does the platform provide access to removed content? Is the data provided by the platform consistent over time? Does the data provide time-series chronological information? Quality Coverage Public Coverage Inferred Coverage Aggregate Media Recency Historical Granularity Completeness Consistency Chronology Ease of Use 11 API Access Dashboard Access Download Access Speed Rate Limiting Volume Rate Limiting Independent Storage Access Restriction Combinability Search Boolean Logic Documentation Language Accessibility Feedback Support Accessibility 12 Does the platform provide an API to enable researcher access to data? Does the platform provide an interactive dashboard to enable researcher access to data? Does the platform provide data via downloadable archives? Does the platform enable unlimited frequency of requests for querying the data access system? Does the platform provide unlimited access to the volume of data a researcher can collect and analyze? Does the platform allow the researcher to extract and store the data independently? Does the platform allow the researcher to access the research API / system from anywhere? Does the platform permit joining data with externally sourced data or tools? Does the platform provide keyword search and filtering capabilities? Does the platform provide the ability to apply boolean logic to search and filter data? Does the platform provide detailed documentation on the use of the data platform? Does the platform provide documentation in EU local languages other than English? Does the platform provide a mechanism for submitting feedback and feature requests for the system? Does the platform provide a support service to researchers? Eligibility Exclusivity Geo-limiting Vetting Data Access Affordability Security Requirements Infrastructure Affordability Application Ease Rapid-response Feasibility Collaboration Friendly Simplicity Appeals Flexibility Are all types of researchers eligible for access to the researcher data offering? Are researchers able to apply without unreasonable demonstration of approvals and credentials? Are researchers able to apply from anywhere in the world? Are the researcher vetting requirements fair and appropriate?1 Is the access to the data provided free of charge? Are the security requirements for data access reasonable?2 Is it affordable to maintain the infrastructure required to work with the data? Is the access request process expected to be completed in a reasonable amount of time? Can rapid–response research be conducted? Does data access permit data-sharing within and across institutions? Is the application process simple, user-friendly, and easy? Are researchers able to understand and appeal decisions around access? Is it possible to revise or extend research use-cases on existing access approvals? 1 Vetting requirements were judged based on the questions: Are the requirements for approval onerous or limiting in ways that make it difficult to get access? Are IRB approvals required in order to get access? 2 Security Requirements were judged based on the questions: Does the platform require levels of security which are not easily maintained by individual researchers? Does the platform require the use of specific tools or processes? Does the platform require processes or procedures which negatively impact the research process? 13 Longevity Terms of Use Fairness Clarity Autonomy Technical Limitations Privacy and Security Personal Data Protection Aggregation Privacy User Privacy Expectations Data Security Oversight Is the researcher access available for a significant and reasonable period of time? Are the terms of use reasonable given the research landscape? Are the terms of use easy to review and comprehend? Do the terms ensure researcher autonomy and independence? Do the terms ensure there are no significant technical limitations that could impact research? Does the data access ensure adequate protection of personal data? Are data aggregation limits sufficient to protect against re-indentification? Does the data access uphold our inferred3 user expectations about public and private forms of data? Does the data access system provide adequate security protocols?4 Is there sufficient oversight of researcher activity to ensure compliance with security and privacy protocols? Platform Evaluation Strategy To answer the questions, we reviewed documentation available on the VLOPsʼ and VLOSEsʼ websites (e.g., terms of use, researcher program descriptions, researcher program applications, API documentation, etc.) and documented our findings in an Airtable database. We referenced existing research into data access programs, including the Platform Transparency Tools Tracker, launched by Anna Lenhart and Annika 3 We inferred user expectations about privacy subjectively by contemplating for a given platform surface whether or not we would assume such data would be seen by people other than those we intended. For example, we assume a public TikTok video can be seen by the world, but our comments on a Facebook friend’s personal photo would not. 4 Data security was assessed by reviewing the policies described and applied, and where applicable, the technical requirements associated with the data access. 14 Springsteel and maintained by the Institute for Data, Democracy & Politics at George Washington University. We also reached out to each VLOP/SE describing our research and requesting additional information. Scoring We developed a strategy for scoring each platform on each measure using a scale from 05, based on the relative extent to which they met our criteria. A score of zero indicates the criterion was not met; a score of five indicates the criterion was fully met. A score of one through four indicates the criterion was partially met. In many cases, platforms score 0 for a measure because they do not offer a program or service, or because it is impossible to measure what they offer at this time. We internally mark scores that are not measurable in order to evaluate averages with and without these measures, however the final averages and ranking include every score including those which were “not applicable.ˮ Finally, in several cases it was necessary to infer based on context scores where we know enough about the programs and the data, and felt this approach more accurately reflected relative value than using an absolute determination based on public information. Thus on the whole, we try to give a fair representation of the relative value on these measures between platforms even though the scores were determined subjectively. 15 Observations Our assessment of the VLOP/SEs yielded key insights into the different platformsʼ approaches to making publicly available data accessible to the research community. Quality 16 Time series data is missing from virtually all providers Researchers need access to information about what has happened over time to understand trends and spikes in interest in subjects, and to characterize the growth in reach and influence of problematic actors. Previously, CrowdTangle provided this data about Facebook and Instagram, but the Meta Content Library has lost this feature, and none of the other platforms except Wikipedia maintain such change-over-time data. Bingʼs search API appears to provide search volume over time similar to Google Trends data. Consistency in the data over time is difficult to verify Researchers need to trust the data they will use when attempting to publish, and almost none of the providers make clear guarantees about the consistency of the data. Meta has released a new tool called the Content Library, which has documentation that specifically acknowledges that the data it returns will not be consistent from day to day or truly representative of what is on the platform for a given search5. 5 https://developers.facebook.com/docs/content-library-and-api/content-library/#additional-information 17 Ease of Use There are no significant access restrictions except for Meta Access restrictions are minimal across the board, except for Metaʼs offering. Most providers make it easy to connect to the data regardless of where you are and do not require special software or strategies for using the data. Metaʼs Content Library is a significant exception because it requires using both a VPN to connect and applies significant limitations on how to collect and analyze data from their API, such as only showing the first 1000 results in the dashboard, and automatically deleting data from the researchersʼ research tool every 30 days. 18 APIs are the primary mechanism of data access APIs are widespread with the highest ranking providers, and are the ideal mechanism for making data accessible to researchers in a systematic manner, but require some technical skill to leverage. Dashboards are an unmet need Dashboards for public data do not exist for any services except Meta. This kind of accessible exploratory tool is widely used by non-technical and non-academic researchers at newsrooms and NGOs. While such tools are expensive to build and maintain, they would significantly expand usability, especially in support of real-time, rapid-response, and targeted research. A partial exception exists for Wikipedia, because the platform itself is effectively an exploratory dashboard in which all the data and history is explorable on the web. Customer support for researchers is still in its infancy Documentation, feedback systems, and technical support are extremely limited overall. The major providers do offer some of these components, though effectively only in English, and not in the other major languages spoken in the EU. Rate-limits may significantly impede research The majority of programs have significant rate limitations for the volume of data and speed to access it. While the restrictions may be intended to prevent abuse of these systems, the measures are impractical and significantly limit the potential for researchers to carry out larger-scale and longitudinal analysis, which by their nature require large volumes of data in order to identify systemic risks. 19 Accessibility Programs are not designed for emerging crises or information shocks The current structure of the programs do not appear to enable rapid-response research that emerges from current events. For example, if a researcher wants to monitor narratives and trends unfolding in real-time around the conflict in Gaza to identify the prevalence of hate speech6, or if they wanted to understand the prevalence of foreign disinformation in the conversation about the European farmers' protests7, the research application process would significantly hinder rapid-response monitoring on nearly all 6 https://www.nytimes.com/2023/11/15/technology/hate-speech-israel-gaza-internet.html 7 https://www.politico.eu/article/europe-farmer-protest-russia-war-propaganda/ 20 providers. Users with existing access could leverage that access to pivot to rapid response, but might risk violating their agreements which were developed and signed in the context of other specific areas of research. Terms of Use Terms are generally fair The terms of use for almost all programs–where they exist–are reasonable and straightforward. They do not generally restrict autonomy, with the mild exception of TikTok, which requires pre-publication review. Research scale may be compromised by technical limits imposed by the terms of use Technical limitations specified in the terms that relate to data retention could create various issues that impede research, however those limitations are driven at least in part 21 by legitimate privacy concerns. The practical challenges created by legal terms that limit the volume, speed of access, and scraping of data can significantly impact the quality and diversity of research that can be produced. Security and Privacy Privacy appears to be generally protected Most of the data access programs appear to maintain our inferred user expectations of privacy because they only grant access to data that is publicly available. However, those expectations are somewhat ambiguous when it comes to public data about comments, which is made available from YouTube, TikTok, and Meta. Security is self-reported and potentially performative While many of the programs require researchers to provide information about data security and related privacy protections, there appears to be almost no mechanism for 22 auditing or enforcement, except with YouTube. In addition, Meta maintains control over the research infrastructure which largely eliminates the need for verification. General Data Access Programs come in four flavors Platforms have exercised significantly differing strategies in making public data accessible to researchers. As we discuss below, some programs afford researchers significantly more access than others, and in some cases – as with data requests – itʼs challenging to assess precisely how expansive access is. Our work identified four types of access: Overview Platforms Formal Programs These platforms have established new, formal programs to service researchers with access to APIs, dashboards, and other resources created for the purpose of research under Section 40.12 of the DSA. AliExpress Google Search Facebook Instagram TikTok LinkedIn Existing APIs and Access Programs These platforms have previously established APIs and data access programs that make public data available. Bing, Wikipedia, X, and YouTube have elected to make these APIs available to eligible researchers under Section 40.12 of the DSA. Bing Google Maps Wikipedia X Twitter) YouTube Permission to Scrape These platforms explicitly or implicitly allow the scraping of public data by researchers. Google Play, Shopping, and Search permit “limited scraping.ˮ The customer terms of service Section A14. Intellectual Property Rights) for Booking.com state, “Youʼre not allowed to monitor, copy, scrape/crawl, download, reproduce, or otherwise use anything on our Platform for any commercial purpose without written permission of Booking.com or its licensors.ˮ Given that the access granted to Amazon Store8 Google Play Google Shopping Booking.com Pinterest9 8 Amazon has confirmed to our research team via email that “using industry-standard, research-specific tools to collect publicly available data” is not prohibited. 9 Pinterest confirmed via email that they “would allow scraping by a researcher who was qualified under DSA Art. 40(12)” 23 public data under Article 40 of the Digital Services Act requires researchers to be “independent from commercial interests,ˮ it appears that DSA-eligible researchers are permitted to scrape the platform. Data Requests These platforms provide a way for eligible researchers to make specific requests for public data, which the platform then provides (e.g., in a .csv file). Researchers are not given broad access to public data; rather, they must describe the kind of public data their research requires, and the platform will respond directly to the request. These programs currently lack significant documentation about the format data will be provided in or specifics about what data may be made available. Booking.com Pinterest Snapchat None Two VLOPs do not appear to have made any effort to provide access to public data. As of the time of writing, we have not been able to identify any formal programs, specifically relevant APIs, or other data offerings from Appleʼs App Store10 or Zalando11. Apple App Store Zalando Programs are new and untested Many of the programs and other offerings made available by VLOPs and VLOSEs are new and have not been used extensively by the research community. At least two platforms – Snapchat and Pinterest – said that no researchers had made use of their programs as of early 2024 when asked. While our work sought to test these programs and offerings, we were only able to gain access to a few of them, namely the Meta Content Library, Bingʼs researcher program, the YouTube API, and Wikipedia. Furthermore, many researchers have only just begun using the programs themselves, or are in the process of applying to them. These programs have not faced extensive scrutiny by researchers at this point. The true utility of these programs will become clearer as a larger diversity of researchers test the data access offerings using different methods and exploring a variety of questions and problems. 10 Apple responded to our inquiry, stating that DSA-eligible researchers can contact the platform to receive publicly accessible data. 11 Zalando’s legal notice page indicates that researchers can contact them for data access, but provides no information about the offering on their website. 24 Programs are changing During the course of our research, many VLOP/SEs stood up new programs/data access offerings, or changed their offerings. This necessitated revisiting our grading as programs and applications changed. Our grades reflect platform offerings as of May 2024. Programs donʼt explain what they offer Apart from the differences between program offerings, VLOP/SEs vary widely in the types of documentation and support they make available to researchers. In general, programs that offer APIs tend to make more extensive documentation available to researchers to describe the types of data available and how to use it. Platforms that require researchers to make specific data requests offer very little information about what data they will share (beyond “public dataˮ) or how the data will be formatted. This results in “cart before the horseˮ research planning, as researchers cannot understand what is available. This makes it challenging to formulate precise research questions before applying for data access. This may discourage researchers from using the programs. There is currently no formal definition of “publicly accessible dataˮ which allows platforms to remain ambiguous about how they define the “public dataˮ offered. Programs lack tools for those without technical skills Few platforms offer dashboards or other tools that would allow a researcher without any technical skills to view or analyze data. Meta, a notable exception, introduced the Meta Content Library and API in 2023, which includes tools to search and view data via keywords and “producerˮ lists. In 2024, Meta added the ability for researchers to download a selection of publicly available data from sources that have significant public followings (pages with more than 15,000 likes and profiles with greater than 25,000 followers). Loss of CrowdTangle CrowdTangle, a research platform from Meta that will be discontinued as of August 2024, has served as an example of a relatively robust data access tool that can be used extensively by technical and non-technical researchers alike. Unfortunately, the Meta Content Library MCL does not yet offer the same functionality and in some cases level of granularity as CrowdTangle. While Metaʼs Content Library demonstrates the most extensive effort to provide researcher-specific tooling, and adds several new types of data that were previously unavailable to researchers in CrowdTangle, the system lacks features and data that were previously available. 25 Recommendations for public data sharing programs For Regulators 1. Regulators should offer a formal definition of “public dataˮ in regulatory guidance that platforms, researchers, and regulators can reference. 2. Regulators should require platforms to publish specifications and documentation for their researcher access programs. 3. Regulators should expand researcher access needs to support exploratory research without a pre-defined research question or hypothesis, in relation to systemic risks. 4. Regulators should allow organization-based access to research programs, rather than project-based access to enable researchers to continually monitor and probe data for new questions/problems. 5. Regulators should require platforms to capture and publish the number of researcher data access requests they receive, the number they approve, and provide guidance on their standards for approval. 6. Regulators should establish a way of including the behavior of algorithmic recommender systems in the concept of public data so that researchers can request access to the aggregate outputs of algorithmic recommendation systems to understand which public data users see. 7. Regulators should clarify the legitimacy of scraping as a means of accessing public data. Several major platforms already permit, either implicitly or explicitly, data scraping as a form of data access for research purposes. Implicit permission to scrape, as we have received for instance over email from some providers, is likely not sufficient reassurance for most researchers. Legal ambiguities remain and privacy concerns in the practice of scraping are therefore not systematically addressed. 8. Regulators should provide funding to support and encourage researchers' initial exploration and testing of these programs. For Platforms 9. Platforms should publish more extensive documentation about what data is available to researchers and how that data will be provided. 10. Platforms should provide some way of auditing what is “publicly availableˮ within a given platform by distributing data schemas, surface audits, or other means that enable researchers and regulators to have a clear and complete understanding of what the typical user can see on the platform. 11. Platforms should create offerings for non-technical researchers, such as dashboards or downloadable datasets for non-programmatic analysis. 26 12. Platforms should enable and encourage the creation of third-party tooling such as dashboards, data-donation repositories, and historical archives to increase accessibility and capacity for collaborative research. 13. Platforms should provide easy-to-find or direct links to their DSA-specific researcher data access programs. 14. Platforms should maintain and enable access to time-series data about content engagement, account growth, and any other relevant attributes that change over time and are necessary for monitoring systemic risks 15. Platforms should develop mechanisms for documenting and validating the consistency and accuracy of data over time. 16. Platforms should develop more robust researcher resources for both applying for access and using data. This includes documenting the anticipated wait time for applications, providing examples of data privacy and security protocols to comply with, and providing technical customer support and feedback tools for the data access tools themselves. 17. Platformsshouldincreaserate-limitsand/orprovidemechanismsforsecuring exceptions to the limits for specific use-cases. 18. Platforms should create multi-language documentation. 19. Platforms should create a path for rapid-response access during moments that demand action. 20. Platforms should develop and document processes for reviewing and auditing researchersʼ security and privacy protocols in ways that are material and verifiable, but do not feel disproportionately burdensome. For Researchers 21. Researchers should apply to and use these public data programs to improve understanding of offerings, document and report on challenges of applying and using data, and participate in field-wide surveys to increase shared awareness. 22. Researchers should share best practices for meeting privacy and security requirements for data access to reduce the burden on researchers who are not familiar with or donʼt have capacity to develop these protocols. 23. Researchers should develop a shared public list of researcher data access use-cases that represent edge-cases not currently supported by platform programs. 24. Researchers should identify and document time-sensitive rapid-response moments that would benefit from researcher data access that are not possible under the current programs. 25. Non-academically affiliated and non-EU-based researchers should partner with EU-based academic institutions to pursue research related to systemic risks to exemplify the value of global research collaboration, especially those in Global Majority context. 27 Challenges and Limitations We wish to acknowledge some significant challenges we faced in our effort to develop this scorecard. Understanding these limitations is necessary context for how the current set of scores can be used responsibly, and should also guide future efforts to expand this work Program Access We were not able to access most of the programs themselves, because of the various limitations imposed by the applications for access, including geography, specific research requirements associated with systemic risks, academic credentials, and technical specifications for data security and privacy. This means that our scores are based primarily on documentation, public statements, feedback from researchers who currently possess or have applied to get access, and in some cases, inferences based on general knowledge of the programs and platforms. As of May 1, 2024, few other researchers have secured access to these programs, which has limited our ability to get detailed information about their experiences with usability, completeness, consistency, the timing for application reviews, and other factors. We anticipate that forthcoming research, for example as led by the Coalition for Independent Technology Research will provide a much larger sample of researchersʼ experiences that will shed light on these issues. Documentation The current state of researcher access programs generally suffer from limited or non-existent documentation. In addition, for many of these programs, the information and application forms that do exist are buried and difficult to find. As a result, it is impossible to verify if what we found is comprehensive and accurate. In many cases, the information about researcher data access programs linked to the DSA obligations live on separate sites without good SEO or links to the top-level of the terms, trust and safety, or other transparency resources. After we developed draft evaluations and shared their scores for comment, more platforms did respond to provide additional information and context. Timing Over the course of this period, several programs and application forms emerged that were not available when we first began. In addition, programs changed over the course of this research, offering new information, or announcing enhanced features. The new releases are often un-dated making it difficult to determine when they were released and/or changed. We have made our best effort to revise our grades and collect additional evidence, but because these programs are actively under development, some findings may be outdated upon publication. 28 Platform Distinctions The 19 VLOP/SEs vary dramatically from one another. Several are retail marketplaces for physical and virtual goods and services Amazon, Apple App Store, Booking, Google Play, Zalando) others are mainstream social media platforms Facebook, Instagram, TikTok, X, YouTube), some are more private social networks LinkedIn, Snapchat) and others are more complex services Bing, Google Maps, Google Search). As a result, it is difficult to evaluate what constitutes “publicˮ data on the platforms, and to compare their offerings to one another. While this baseline is a meaningful start, future work in this space should likely separate these distinct types of platforms from one another for analysis. Lessons Learned Our initial set of measures and categories of criteria for evaluation we developed with care and feedback from the community of stakeholders who will use this data access. However, after attempting to evaluate the platforms on these measures, it became clear that although we wish to know the answers to these questions, it is not feasible to identify those answers. This was true for various measures, but particularly the Security and Privacy questions, as well as areas where we start from a hypothetical expectation for data access, including the Coverage of Inferred Data, the Rapid Response Feasibility, and Aggregate Data. Broad Scope There are some key aspects of these platforms which are beyond the scope of this methodology. Algorithmic recommender systems are not explicitly mentioned under DSA 40.12, but represent a possible factor contributing to systemic risks, for instance to civic discourse, electoral processes, or public security. These features straddle the line between what can be considered “public dataˮ and proprietary systems. The results of algorithmic recommenders are frequently visible as public data, but their risk and impact can only be evaluated in aggregate (e.g., how many people are recommended toxic content based on behavior). Currently, these platforms do not provide researchers access to information about algorithmic recommenders, so this research does not attempt to address this critical area. This will likely be an area for DSA “vetted researchersˮ to interrogate through Article 40.4 requests, which allow researchers meeting certain qualifications to apply to access non-public data sets. Access to public data via Article 40.12 could be an important exploratory mechanism to help formulate stronger Article 40.4 requests. 29 Conclusion This research has sought to take a first look at many of the new or revised public data sharing mechanisms provided by large online platforms, in the context of the Digital Services Act. The DSA is establishing a novel structure for data sharing between platforms and vetted researchers under its Article 40. However, it also establishes a requirement for public data sharing Article 40.12. In this research we consider public data sharing by 19 platforms. We developed an assessment rubric to score each program based on criteria developed in consultation with partners. In our 47 criteria, we try to reflect the research needs of the wider public interest research community who currently or would likely make use of this public data. We developed these criteria in the absence of key regulatory concepts like a definition of public data. Our scores cannot and should not be interpreted to assess compliance under the Digital Services Act. We acknowledge that our scores also represent a snapshot or moment in time look at these programs. Indeed, we hope this report will nourish the continued development of these programs. To that end, we provide our rubric for evaluating these programs based on our detailed criteria along with this first set of scores. We also offer 25 specific recommendations to regulators, platforms, and researchers. 30 31 Platform Appendix Detailed Scores Breakdown For a complete list of each platform and their respective scores on each measure, we have produced a separate document that can be accessed here to review, compare, and download the data. These scores are accurate as of May 1st, 2024 and do not reflect any changes by the platforms since that date. AliExpress Overview: AliExpress has established the Open Research and Transparency portal, which allows approved researchers to request access to publicly available data. AliExpress offers “privacy protected datasets and other statistical dataˮ as well as the Dataworks API through an “internal network-based controlled access environment that does not allow data to be exported.ˮ Researchers affiliated with academic institutions and nonprofit organizations in the EU can apply for access. What we looked at: • ● Researcher access overview site • ● Application form Resource Links: • ● EU Digital Services Act • ● Open Research & Transparency • ● AliExpress Open Research & Transparency: Application for access to publicly accessible information by researchers Outreach: ● We reached out to AliExpress for comment and discussion, but did not receive a response. 32 Amazon Store Overview: Amazon Store does not offer a formal program for data access at this time. However, a spokesperson for Amazon Store responded to our inquiry via email, “We can confirm that our Conditions of Use do not prevent researchers who comply with the conditions of the DSA from using industry-standard, research-specific tools to collect publicly available data from our EU Store, for the purposes of Article 4012 of the DSA.ˮ The feasibility of collecting data in this way remains ambiguous because of the potential for anti-scraping systems to inhibit this activity in the real-world. What we looked at: ● Website terms and conditions Resource Links: ● Conditions of Use & Sale Outreach: ● Amazon EU responded to our request for comment and discussion, stating that the EU store Conditions of Use permit DSA-eligible researchers to use “industry-standard, research.-specific toolsˮ for public data collection. 33 Appleʼs App Store Overview: Appleʼs App Store does not offer a formal program for data access at this time. When reached for comment, a member of Appleʼs DSA Compliance team stated the company has “procedures in place for researchers to request App Store data that is publicly available, in accordance with Article 4012 of the DSAˮ and that researchers “interested in obtaining publicly accessible data may contact Apple DSA Compliance.ˮ What we looked at: ● Digital Services Act compliance page Resources: ● European Digital Services Act DSA Outreach: ● We reached out to Apple for comment and discussion. We received a response from Apple, which detailed the above. 34 Bing Overview: Bing has established the Qualified Researcher Program to offer researchers access to public data. Researchers free from commercial interests can apply from anywhere in the world. The program makes available the Bing Search API, Bing Webmaster Tools, and datasets to approved researchers. What we looked at: • ● Digital Services Act compliance page • ● Researcher access overview site • ● Application form • ● Tool documentation Resources: • ● EU Digital Services Act information • ● Bing Qualified Researcher Program • ● Bing Qualified Researcher Program Application • ● Bing Research Resources • ● Bing Web Search API • ● Bing Webmaster Tools Outreach: ● We reached out to Bing for comment and discussion. We applied for and received access to Bingʼs program. 35 Booking.com Overview: DSA-eligible researchers can apply for access to public data via Booking.comʼs Researchers Data Request Portal, though it is not clear in what format the data will be provided. In addition, the customer terms of service Section A14. Intellectual Property Rights) for Booking.com state, “Youʼre not allowed to monitor, copy, scrape/crawl, download, reproduce, or otherwise use anything on our Platform for any commercial purpose without written permission of Booking.com or its licensors.ˮ Given that the access granted to public data under Article 40.12 of the Digital Services Act requires researchers to be “independent from commercial interests,ˮ it appears that DSA-eligible researchers are likely permitted to scrape the platform. What we looked at: • ● Website terms and conditions • ● Digital Services Act compliance page Resource links: • ● Digital Services Act • ● DSA Researchers Data Request Portal • ● Booking.com Researcher Data Use Policy • ● Customer terms of service Outreach: ● We reached out to Booking.com for comment and discussion, but did not receive a response in time to incorporate into our report. We received an automated response with an application to determine eligibility for data access. 36 Meta: Facebook & Instagram Overview: Meta has established the Meta Content Library and Content Library API – a new formal program offering researchers access to public data on Facebook and Instagram. The MCL offers a dashboard interface for users to query, sort, and filter content. The Content Library API enables researchers to query public data programmatically and analyze it in a clean room. The application process and MCL access are managed by the Inter-university Consortium for Political and Social Research ICPSR at the University of Michigan. Researchers can apply from outside of the EU. Metaʼs existing program, CrowdTangle, is set to be discontinued in August 2024. What we looked at: • ● Researcher access overview site • ● Application form and resources • ● Tool documentation • ● Terms and conditions Resource links: • ● Research tools: Meta Content Library and API • ● Meta for Developers: Meta Content Library and API • ● SOMAR InfoReady Application Guide • ● Other research tools and datasets • ● Product Terms for Meta Research Tools • ● Frequently asked questions Outreach: ● We reached out to Meta for comment and discussion, and met with members of the research partnerships and policy teams. 37 Google Maps Overview: Alphabet has established a formal program and application process for accessing public data. To access Maps data, Alphabet offers “access to public data through a cloud-based solution.ˮ Only researchers in the EU are eligible for access to the Google Researcher Program. What we looked at: • ● Researcher access overview site • ● Application form • ● Tool documentation Resource links: • ● Google Researcher Program • ● Google Researcher Program Application • ● Google Transparency Center: Researcher Engagement • ● Google Researcher Program Acceptable Use Policy • ● Google Maps for developers Outreach: ● We reached out to contacts at Alphabet for comment and discussion, but ultimately only met with a representative from YouTubeʼs policy team. 38 Google Play Overview: Alphabet has established a formal program and application process for accessing public data. To access Play data, Alphabet offers “permission for limited scraping.ˮ Only researchers in the EU are eligible for access to the Google Researcher Program. What we looked at: • ● Researcher access overview site • ● Application form Resource links: • ● Google Researcher Program • ● Google Researcher Program Application • ● Google Transparency Center: Researcher Engagement • ● Google Researcher Program Acceptable Use Policy Outreach: ● We reached out to contacts at Alphabet for comment and discussion, but ultimately only met with a representative from YouTubeʼs policy team. 39 Google Search Overview: Alphabet has established a formal program and application process for accessing public data. To access Search data, Alphabet offers an “API for limited scraping with a budget for quota.ˮ Only researchers in the EU are eligible for access to the Google Researcher Program. What we looked at: • ● Researcher access overview site • ● Application form Resource links: • ● Google Researcher Program • ● Google Researcher Program Application • ● Google Transparency Center: Researcher Engagement • ● Google Researcher Program Acceptable Use Policy Outreach: ● We reached out to contacts at Alphabet for comment and discussion, but ultimately only met with a representative from YouTubeʼs policy team. 40 Google Shopping Overview: Alphabet has established a formal program and application process for accessing public data. To access Shopping data, Alphabet offers a “permission for limited scraping.ˮ Only researchers in the EU are eligible for access to the Google Researcher Program. What we looked at: • ● Researcher access overview site • ● Application form Resource links: • ● Google Researcher Program • ● Google Researcher Program Application • ● Google Transparency Center: Researcher Engagement • ● Google Researcher Program Acceptable Use Policy Outreach: ● We reached out to contacts at Alphabet for comment and discussion, but ultimately only met with a representative from YouTubeʼs policy team. 41 LinkedIn Overview: LinkedIn has established a Researcher Access Program and application process for accessing public data. Global researchers can apply for access to the platformʼs public data, but the format the data is provided in is unclear. What we looked at: • ● Researcher access overview site • ● Application form • ● Terms and conditions Resource Links: • ● LinkedIn: Researcher access • ● LinkedIn Researcher Access Program Application • ● Additional Terms for the LinkedIn Research Tools Program Outreach: ● We reached out to LinkedIn for comment and discussion and received a response from a member of the platformʼs legal team. We requested access to LinkedInʼs program, but were denied. 42 Pinterest Overview: Pinterest has established an application form through which global researchers can request access to the platformʼs public data. It is not clear what format the data is provided in. What we looked at: • ● Digital Services Act compliance page • ● Application form Resource links: • ● Digital Services Act • ● Researchers intake form Outreach: ● We reached out to Pinterest for comment and discussion. We met with a member of the platformʼs legal team, who confirmed researchers can request approval to automatically collect publicly accessible data. 43 Snapchat Overview: Snapchat has established guidelines and an email contact through which global researchers can request access to the platformʼs public data. According to Snapchat, data will be provided to the researcher in a .csv file. What we looked at: • ● Digital Services Act compliance page • ● Researcher access overview site Resource links: • ● Privacy and Security: European Digital Services Act DSA • ● Researcher and Data Access Instructions Outreach: ● We reached out to Snapchat for comment and discussion and met with members of the platformʼs legal and regulatory teams. 44 TikTok Overview: TikTok offers a Research API that allows access to publicly available data related to accounts and content on the platform. U.S. and E.U. researchers affiliated with non-profit academic and research institutions are eligible to apply. What we looked at: • ● Researcher access overview site • ● Application form • ● Tool documentation Resource links: • ● Expanding TikTok's Research API and Commercial Content Library • ● Research API • ● About Research API • ● Research API Getting Started • ● Research API FAQ • ● Research API codebook • ● Research API Terms Outreach: ● We reached out to TikTok for comment and discussion, but did not receive a response. 45 Wikipedia Overview: The Wikimedia Foundation has made its public data available in a variety of ways, since before the enactment of the DSA. Researchers do not need to request or apply for access to the data; researchers can use existing tools to programmatically collect and analyze the data. What we looked at: • ● Tool documentation • ● Researcher access overview site • ● Terms and conditions Resource links: • ● Wikipedia is now a Very Large Online Platform VLOP under new European Union rules: Hereʼs what that means for Wikimedians and readers • ● Wikipedia API Parsed Infobox. Introducing Structured Contents • ● Research:Data • ● Wikimedia Research • ● Wikimedia Foundation Open Access Policy • ● Wikipedia Page History • ● Terms: Content Licensing • ● Wikimedia Downloads • ● Research Data FAQs Outreach: ● We reached out to Wikipedia for comment and discussion and met with members of their legal and global advocacy teams. 46 X/Twitter Overview: X has an application for DSA-eligible researchers through which they can request access to public data via the X API. This pre-existing API is available to other researchers for a fee. What we looked at: • ● Researcher access overview site • ● Application form • ● Terms and conditions • ● Digital Services Act compliance page • ● Tool documentation Resource links: • ● Country-specific Resources: European Union • ● Developer Terms of Service • ● X Developer Platform: Academic research • ● X DSA Researcher Application • ● Dev community: academics • ● Difference between commercial and noncommercial usage Outreach: ● We reached out to X for comment and discussion, but did not receive a response. 47 YouTube Overview: YouTube has established a Researcher Program to grant academic researchers access to public data on the platform. The program gives access to YouTubeʼs Data API after approval of the researcherʼs application. A program also allows access for limited scraping for DSA-eligible researchers. What we looked at: • ● Researcher access overview site • ● Application form • ● Terms and conditions • ● Tool documentation Resource links: • ● YouTube Program Policies • ● YouTube Researcher Program Application • ● YouTube Program Terms and Conditions • ● YouTube Data API Overview • ● YouTube API Services - Developer Policies • ● YouTube API Services Terms of Service Outreach: ● We reached out to contacts at Alphabet for comment and discussion and met with a representative from YouTubeʼs policy team. 48 Zalando Overview: Zalando states that researchers can contact the platform through the DSA Single Point of Contact to receive data “in accordance with Article 40 DSA.ˮ What we looked at: ● Terms and conditions Resource links: • ● Zalando and the Digital Services Act • ● Zalando files legal action against the European Commission to contest its designation as a “Very Large Online Platformˮ as defined by the Digital Services Act • ● Legal Notice Outreach: ● We reached out to Zalando for comment and discussion, but did not receive a response. 49 This report was produced by: Cameron Hickey Kaitlyn Dowling Isabella Navia Claire Pershan Design by: Shannon Zepeda and Tess Heinricks With thanks to: Becca Ricks Nicholas Piachaud Julian Jaursch Martin Degeling Brandon Silverman Svea Windwehr Anna Lenhart Henry Tuck Also a special thanks to the participants of Mozillaʼs workshop held in April 2023 and everyone who contributed to and endorsed the subsequent recommendations for public data sharing for public interest research.