Author: Thomas Padilla

  • Releasing The Public Interest Corpus Principles and Goals

    “Captain” Mary Converse, instructing V-7 (candidates for United States Navy ensign commissions) students in use of sextant, compass and gyroscope and in navigation

    Today, we are pleased to release The Public Interest Corpus Principles and Goals. This release builds on the recap of our final planning workshop and anticipates release of our final deliverable later this month.

    Early on in The Public Interest Corpus planning process, we were encouraged by our advisory board to dedicate significant effort to the development of multi-stakeholder informed Public Interest Corpus principles and goals. The principles and goals are intended to support collective decision making as The Public Interest Corpus moves from a planning phase to an implementation phase.

    Over the course of the year we iteratively developed the principles and goals with feedback from a diverse set of stakeholders (e.g., researchers, authors, librarians, publishers, technologists) across the United States. Hundreds of contributions later we believe we have landed on a set of principles and goals that align interests and lay a path for delivering concrete value to the communities that libraries aim to serve. Special thanks to The Public Interest Corpus contributors and the advisory board.


    Principles and Goals

    The Public Interest Corpus works with a growing coalition of stakeholders to develop a service that advances the library community’s ability to support the responsible use of their collections for AI research and development and computational research more generally. The initial focus of the service is on a corpus development, discovery, and access solution for books data (digitized and/or born digital text with metadata) at scale. Some estimates suggest that ~162,000,000 books have been created globally, with ~2,200,000 new books published each year. Collectively, libraries steward the most comprehensive source of human inquiry recorded in book form. 

    The Public Interest Corpus is inspired by open corpus development efforts from organizations like Wikipedia and PLEAIS. The Public Interest Corpus is also encouraged by efforts like the Institutional Data Initiative and European Books Data Commons. The Public Interest Corpus builds on these efforts by working to provide access to in-copyright as well as public domain books data on terms that are legal and ethical.

    Academic researchers in particular have a pressing need to gain access to books data at scale, but they face numerous challenges. To start, accessing in-copyright books data for AI development is extraordinarily expensive. The public interest is not well-served by barriers that de facto restrict books data access to the wealthiest for-profit technology companies. Furthermore, in-copyright books data is typically excluded from broader use on an overly narrow licensing basis, weakening transparency and public trust in AI-driven research. Compounding difficulty, researchers leveraging computational methods are faced with a piecemeal data access ecosystem, where access to data must be pursued across numerous sources with variable pricing, policy, and licenses. Given significant administrative and financial barriers, researchers working within and outside of higher education find themselves pulled toward data of less than optimal quality, exhibiting significant biases, lacking comprehensiveness, and made available in a manner that is both legally and ethically problematic. 

    AI researchers and developers must also contend with the fact that many books – public domain and in-copyright – are simply not digitized. Without digitization, the potential of one of humanity’s most comprehensive knowledge bases is simply not usable for AI research and development. Digitization of book collections at scale will require significant, ongoing public and private financial investment. It is essential that contracts supporting public and private digitization partnerships contain terms that safeguard the ability of libraries to combine, enhance, and provide access to data produced through these partnerships. 

    In order to support research, teaching, learning, and new forms of creativity, books data should be made available to researchers in accordance with the law and normative community expectations such as promoting author attribution and working to ensure that data bias is well documented. Making books data available at scale strengthens the ability for researchers to leverage AI and other computational methods to meet public interest challenges like fighting misinformation, strengthening understanding of the past and present, and fostering an informed citizenry. It also enables the development of more focused corpora that support fine-tuning existing models and/or development of small models for tailored use cases. 

    The Public Interest Corpus depends on key partnerships with libraries, publishers, researchers, and minoritized communities represented in collections that form the corpus. In working with The Public Interest Corpus, libraries advance their research support mission by combining collections from many organizations in order to produce the most comprehensive, high quality corpora for research use; publishers will advance their mission by multiplying the impact of works created by authors they support; researchers will guide corpora development to ensure that corpora are optimally usable; and communities represented in collections will help ensure that corpora are curated and provisioned in a responsible manner – i.e., documenting the presence of books that contain outdated, stolen, and/or harmful knowledge about minoritized communities. 

    What principles guide The Public Interest Corpus? 

    1. The Public Interest Corpus … advances equitable access to books data for small, medium, and large organizations.  
    2. The Public Interest Corpus …  supports AI research and development and computational research that addresses public interest challenges (e.g., fighting misinformation, advancing understanding of the past and present, fostering a more informed citizenry). 
    3. The Public Interest Corpus …  addresses corpus limitations (e.g., linguistic bias, outmoded forms of knowledge present in the corpus, and data quality) through production of additional metadata in line with efforts like the Hugging Face Model Card and Data Nutrition Label. 
    4. The Public Interest Corpus … commits to transparency with respect to corpus composition, modification, and agreements in order to increase public trust in research that makes use of the corpus. 
    5. The Public Interest Corpus … values the labor of content creators and works to ensure that their work is recognized through promotion of credit and attribution practices. 
    6. The Public Interest Corpus … adopts practices and infrastructure that aim to reduce the environmental impact of corpus development, discovery, and access. 
    7. The Public Interest Corpus … forms partnerships that concretely address long-term collective needs of academic libraries and the communities they serve (e.g., maximizing access, reducing legal encumbrances). 
    8. The Public Interest Corpus  … is fundamentally guided by diverse stakeholders including but not limited to researchers, librarians, publishers, authors, and technologists. 

    What goals should The Public Interest Corpus work to achieve?

    1. Coordinate books data sourcing, discovery, and access across small, medium, and large organizations. 
    2. Create cost efficiencies in access to books data. 
    3. Minimize legal risk for those that seek to provide or make use of books data. 
    4. Curate and provide access to fit-for-purpose books data that exceeds in quality and comprehensiveness what is otherwise available. 
    5. Ensure consistent corpus growth and refinement over time in alignment with user community needs. 
    6. Identify and adopt scalable author credit and attribution methods for authors and rights holders to track reuse. 
    7. Deliver minimum viable solutions
    8. Adopt a fit for purpose governance model
    9. Develop a sustainability model that reduces barriers to books data access for small, medium, and large organizations on an ongoing basis. 

  • The Public Interest Corpus Update – Oakland Edition

    Center for Library & Instructional Computing Services, Undergraduate Library, 1986

    The Public Interest Corpus recently completed the last of three planning workshops. The final workshop was hosted at the University of California Office of the President in Oakland, CA, and built on findings from prior workshops held at Northeastern University and New York University Law School. A diverse group of stakeholders helped sharpen The Public Interest Corpus implementation plan by contributing expert insights on the following topics: (1) Users, Users, and Managing Legal Risk, (2) Data Development and Access, (3) Multi-stakeholder Governance, (4) and Sustainability. 

    Users, Uses, and Managing Legal Risk 

    We began the day with a presentation and discussion of proposed Public Interest Corpus users, anticipated types of uses, and organizational approaches to managing legal risk. The discussion first addressed which users to prioritize for this corpus.  The discussion drew  on insights gained from the planning process, as well as observations about the evolving legal and risk environment (particularly, outcomes in cases such as Bartz v. Anthropic and Kadrey v. Meta), and the growing disparity in access to books for AI applications in the commercial sectors as compared to academic and research settings.  Thus, the project team recommended that the implementation phase for The Public Interest Corpus should primarily serve academic users by providing full-text access to open and in-copyright books data for AI training and computational research more generally. In concert with this recommendation, the project team introduced practical measures that could help organizations manage legal risk associated with academic research use of The Public Interest Corpus. 

    Some takeaways from this workshop discussion:

    • Developing an effective data sharing agreement is key. An effective data sharing agreement must address factors including but not limited to (1) striking a balance between supporting ideal research practices (e.g., reproducible research) and managing legal risk (e.g., prohibitions on in-copyright data sharing), (2) clearly addressing issues pertaining to downstream use (e.g., data use vs. multi-sector use of models developed from those data), (3) and ensuring that the agreement is designed in such a way that it can be readily adopted by  research organizations with variable appetite for legal risk.  
    • An implementation phase must center researchers in service development. Various Public Interest Corpus services need to be tested hand in hand with researchers such as additional metadata creation to accompany data releases (e.g., data bias, data quality) as well as the means to evaluate, select, and securely access data. 

    Data Development and Access 

    Following the Users, Uses, and Managing Legal Risk session, the project team sought feedback on proposed Public Interest Corpus services that could add value to books data (e.g., curation, transformation) as well as the technical means to provide secure full-text access to books data. 

    Some preliminary takeaways from this workshop discussion:

    • The Public Interest Corpus should pursue multiple strategies that encourage book data attribution. Attribution is key to author credit and the integrity of research produced using Public Interest Corpus data. Participants discussed a range of attribution options from simple readme files to more granular forms of attribution.
    • The Public Interest Corpus should provide additional metadata with data releases that account for data limitations (e.g., linguistic bias, outmoded forms of knowledge present in the corpus), data transformations, and data quality (e.g., OCR quality). Researchers emphasized the need for this metadata as it helps evaluate the research potential of the data. A variety of models exist to guide the creation of additional contextual metadata such as Hugging Face’s model card and the Data Nutrition Label
    • The Public Interest Corpus should work with research libraries to assess and plan for how to support researcher use of the Public Interest Corpus. With The Public Interest Corpus primarily focused on data development, access, and use, it is essential to work with research libraries to establish handoffs between Public Interest Corpus services and research library services. 

    Multi-Stakeholder Governance 

    Time and again, project stakeholders have emphasized the importance of governance. The pace of change is rapid in this space and the complexity of coordinating effort across multiple roles and sectors requires a thoughtful and effective approach to governance. 

    Some preliminary takeaways from this workshop discussion:

    Governance should provide a level playing field for stakeholders to guide The Public Interest Corpus. Given a commitment to advancing the public interest, governance opportunities must provide equitable opportunities for a diverse range of organizations to guide strategy.  

    Governance opportunities as well as advisory opportunities should be provided to stakeholders. As with any well-structured community effort, The Public Interest Corpus will have needs best served by governance and other needs best served by stakeholders operating in an advisory capacity. The Public Interest Corpus should provide both opportunities for engagement. 

    Sustainability 

    In the closing session, we asked workshop participants for feedback on a Public Interest Corpus sustainability model. It was our sense, given past experience combined with an assessment of the financial health of the higher education sector and disruptions to the Federal and private funding environment, that the Public Interest Corpus must diversify funding streams in order to achieve sustainability. 

    Some preliminary takeaways from this workshop discussion:

    Moving from a startup phase to a sustaining phase is likely a 4-5 year effort. Diversification of funding is key to the startup phase, requiring significant up-front investment. Over time it will be essential to reduce reliance on initially diversified funding sources (e.g., Federal funding, private funding, commercial partnerships) by transitioning to a funding model that is majority funded by Public Interest Corpus member contributions. 

    Encouraging broader use of the AI infrastructure. In addition to the training corpus and related services, the workshop also explored using the same infrastructure with the Model Context Protocol (MCP) as another front end that would reach a broader audiences and serve other kinds of uses. MCP is an open standard that enables LLMs to connect with tools and resources on demand to help answer queries and to run analyses. It could allow for smaller and more nimble uses of the metadata and full text of books within AI environments, such as commercial chatbots and open source AI tools run by universities and researchers.

    Commercial partnerships must align with The Public Interest Corpus principles and goals. The Public Interest Corpus principles and goals were developed with iterative feedback from community stakeholders throughout the planning process. The principles and goals are intended to ensure that The Public Interest Corpus maintains its commitment to the public interest. 

    Moving Forward 

    In December, we will release (1) the final version of The Public Interest Corpus principles and goals and (2) our core deliverable – lessons learned from the planning phase and the direction we believe The Public Interest Corpus should take as it moves toward an implementation phase.

  • The Public Interest Corpus Update – NYC Edition

    NYU Law School Workshop Participants

    Last month, a diverse set of stakeholders gathered at New York University Law School to contribute to an implementation plan for The Public Interest Corpus. This workshop built upon the first project workshop held at Northeastern University Libraries in the Spring through (1) continued refinement of project principles and goals, (2) documentation of research and library service use cases, and (3) collective ideation on prospective year 1-3 and year 4-6 activities for an implemented version of The Public Interest Corpus.  

    Continued Refinement of  Principles and Goals

    As in the Northeastern University workshop, we began the day with an exercise focused on refining The Public Interest Corpus principles and goals. Participants contributed a broad range of comments, edits, and suggestions that greatly strengthened project principles and goals. 

    Some preliminary takeaways:

    • The Public Interest Corpus should advance an equitable sustainability model that is responsive to variation in resources of potential supporting organizations.  Participants emphasized that a public interest resource must have a sustainability model that enables small, medium, and large organizations to sustainably offer benefits to their communities. 
    • Commercial sector digitization contracts should not inhibit library ability to provide access to and support the use of data produced through digitization partnerships. There was broad recognition that commercial partners would continue to be fundamental to library mass digitization efforts. As libraries maintain and/or enter into new digitization partnerships they should work to remove clauses including but not limited to data embargoes and clauses that are sufficiently vague so as to cast doubt on allowable uses of data. 
    • The Public Interest Corpus should concretely encourage values-aligned, downstream use. Examples of measures that concretely encourage values-aligned use include but are not limited to providing data citation user education, platform features that automatically generate data citations, and/or data sharing agreements that require data citation in context of a range of research & development scenarios – e.g., a published paper, a generative AI application, and so on. 

    Documenting Research and Service Use Cases

    Significant effort was dedicated to documenting research and service use cases. A research use case exercise elicited common challenges that disciplinary researchers encounter seeking to gain access to and make use of data. A service use case exercise elicited common challenges that libraries encounter seeking to provide services in this space. 

    Some preliminary takeaways: 

    Research Use Cases

    • The Public Interest Corpus should develop and provide access to corpora that align with multiple notions of comprehensiveness. Researchers noted that determining collection comprehensiveness was context-dependent. In some cases researchers may be looking for as many books as possible to satisfy their definition of comprehensiveness and in other cases they may be looking for a highly curated set of books that correspond to a specific theme. Given the context-dependent nature of comprehensiveness, The Public Interest Corpus should work with user communities to prioritize the creation of corpora at varying scales for specific purposes. 
    • “Upstream” risk aversion creates an attritional process for “downstream” users. Researchers noted multiple instances where organizational risk aversion relative to computational uses of digitized and/or born digital collections creates a prolonged, drawn out process for accessing and making use of library collections. Researchers hope that The Public Interest Corpus can create an environment where there is less perceived or real risk for organizations and smoother access to collections for end users. 
    • Researchers want to contribute enhanced collections data back to libraries. In many cases researchers are re-OCRing received collections data and/or taking steps to improve metadata (e.g., normalization, enrichment) but have no easy way to contribute enhanced data back to the library for the benefit of other researchers. Researchers are motivated to see that their work on data enhancement benefits a broader community. 

    Service Use Cases 

    • AI capacity is growing in libraries, but it remains to be seen what the library community can achieve together at the level of infrastructure, data, and services. A number of participants noted investments in local AI capacity but acknowledged that little of this capacity has been joined at the community level. Participants reflected on existing community investments in efforts like Hathitrust and expressed interest in determining optimal levels of multi-organizational collaboration on something like The Public Interest Corpus. 
    • Justifying potential investment in a community solution remains challenging, though not insurmountable. The Public Interest Corpus should continue developing  a value proposition that concretely makes the case for how addressing the stated challenge at scale most effectively meets local research needs – e.g., comprehensive, high quality corpora are by necessity the product of combining collections from multiple organizations. 

    Envisioning the future of  The Public Interest corpus 

    In remaining workshop activities participants ideated on year 1-3 and year 4-6 activities for an implemented version of The Public Interest Corpus. 

    Some preliminary takeaways:

    To centralize or decentralize? Or both? As discussions turned toward technical implementation there was substantial discussion of centralization, decentralization, or some combination of both to make The Public Interest Corpus work. Various technical approaches were discussed including but not limited to MCP, vector stores, and APIs feeding a central repository. 

    Balance for quantity and quality in corpus development. Participants noted the importance of both quantity and quality in prioritizing future corpus development. In some cases quality could outweigh quantity, in other cases quantity could outweigh quality, and in other cases a balance could be struck. The Public Interest Corpus should work with a range of stakeholders at varying degrees of intensity to achieve the right balance in corpus development and releases. 

    Plan for future collection scope expansion. Multiple participants suggested that The Public Interest Corpus should plan for expansion of collection scope beyond books data moving forward. Collection scope expansion could include archives and special collections materials. Participants expressed that these collections were of high value and not commonly available in existing training data. They also emphasized the need for deep investigation of potential ethical issues with these materials in the event that collection scope expands. 

    Next Steps 

    We offer thanks to our brilliant workshop participants – they trekked through a very hot and humid NYC to help plan for The Public Interest Corpus and somehow maintained good spirits throughout the day! 

    We have one final workshop this October in Oakland, CA. If you work in the region and are interested in potentially attending please let us know here. If you would simply like to learn more about the project and/or discuss possible collaboration please let us know here.  

    The project team plans to share The Public Interest Corpus Startup Plan by December 2025. 

  • Building The Public Interest Corpus for AI and Computational Research

    Last week, EDUCAUSE published an interview with Dan Cohen and I focused on “Building The Public Interest Corpus for AI and Computational Research”. We appreciate the opportunity to introduce our project to the EDUCAUSE community and look forward to future collaboration down the line.

    We are holding one final workshop for the planning phase of this effort in Oakland, CA this October. If you are interested in contributing to the development of The Public Interest Corpus please let us know.

  • The Public Interest Corpus Update – Boston Edition 

    On March 3, librarians, authors, publishers, and technologists gathered at Northeastern University Library in Boston to contribute to a startup plan for The Public Interest Corpus. The Public Interest Corpus is focused on supporting the creation of high-quality AI training data from memory organizations (e.g., libraries, archives, museums) and their partners (e.g., publishers) that advance the public interest. For too long, access to high-quality training data has been limited to the world’s most well-resourced organizations pushing others toward data of lesser quality, comprehensiveness, and unsettled legality. Over the course of our day together, event participants made strong contributions to the development of The Public Interest Corpus startup plan. 

    Refining Principles and Goals

    We began the day with an exercise focused on refining The Public Interest Corpus principles and goals. We felt this was a good place to start given that principles and goals held in common provide the foundation for collective action. Participants contributed a broad range of comments, edits, and suggestions that greatly strengthened project principles and goals. The project team is in the process of versioning this document and will have more to share down the line. 

    In the interim, we share some takeaways:

    • The Public Interest Corpus should maximize transparency. Participants called for transparency around corpus composition and emphasized how this could support reproducible research and the development of AI that is pluralistic and grounded in particular social contexts. 
    • “Public Interest” is a compelling framing that needs to be made more concrete. Participants expressed a need for more specificity regarding target user communities served by The Public Interest Corpus and strategies that effectively balance public interests and commercial interests in The Public Interest Corpus. 

    Workshopping Core Challenges

    Following the principles and goals activity, we broke participants into mixed stakeholder (author, publisher, legal expert, librarian, technical expert) groups. Group composition was shuffled once more in the afternoon to encourage continued novelty in ideation. Each group was presented with a set of questions to respond to that aligned with the following challenge areas: (1) Target Audiences, Training Data Needs, Potential Partnerships, (2) Legal and Policy, and (3) Business Model, Sustainability, and Governance. As with the principles and goals exercise, the project team is actively processing the product of group activity and will have more to share in the future. 

    In the interim, we share some takeaways: 

    Target Audiences, Training Data Needs, Potential Partnerships

    • Accessing in-copyright books for AI training purposes is extremely difficult for researchers. Challenges include but are not limited to downstream impact of contractual override, organizational uncertainty in making fair use determinations, and multiple active court cases testing AI training as a fair use arguments in the United States. 
    • Focus on simple solutions. Multiple participants suggested that focusing on a simple solution was the best path forward – i.e., identify the most compelling minimum viable product and deliver on it. A solution could become more complex over time through phased development informed by user community studies. 

    Legal and Policy Challenges

    • Balancing what is legally permissible vs. meeting normative community expectations is essential. While it may be the case that AI training on in-copyright works is a fair use, this does not mean that a proposed solution should make works available for training without author or publisher engagement. This effort can learn from engagement with author communities to assess their views and preferences regarding the use of their work for AI training purposes. Authors Alliance continually engages with authors on this issue. 
    • Pending court cases do not provide an insurmountable barrier to a solution. Though active copyright AI litigation is likely to continue for many years, participants believe there are a range of strategies that can be pursued that mitigate legal risk and support development of a solution that advances the public interest. 

    Business Model, Sustainability, and Governance 

    • The Public Interest Corpus should develop multiple prospective business models and test for viability with stakeholders. Participants have indicated that it would be useful for stakeholders to engage with a range of business models with different revenue streams – e.g., membership model, philanthropically supported, commercially supported, hybrid, etc. Participants suggested paths forward that led to creation of a standalone organization or integration within an existing organization. 
    • With an eye toward mission and policy alignment, business models should differentiate between noncommercial and commercial use of The Public Interest Corpus. Potential service costs should be responsive to resource disparities between non-commercial and commercial users. Potential service costs should also be responsive to resource variation within a prospective non-commercial user base. 

    Next Steps 

    We plan to continue engaging core challenges with stakeholders at our next workshop, to be held July 2025 in New York City. If you work in the region and are interested in potentially attending please let us know here

    In addition to the July 2025 workshop, we will present on the project and/or hold additional working events across North America. The next presentation will be at the Coalition for Networked Information meeting in Milwaukee in April. To keep track of future community engagements please refer to our engagements page. 

  • The Public Interest Corpus: An Update and Opportunities for Co-Development 

    A Library salute to National Photography Month and the photographer’s skill for staging eye-catching compositions

    In December 2024 we announced a new project to develop a public interest AI training corpus focused on books. Over the last few months we’ve been actively engaging a diverse set of stakeholders in the development of The Public Interest Corpus. 

    The Public Interest Corpus is focused on developing large-scale, high-quality AI training data from the world’s memory organizations that serve the public interest. In the aggregate, memory organizations like libraries and archives are in a prime position to address this need given a multi-century focus on developing high-quality, locally and globally comprehensive collections of books, newspapers, scholarly journals, photographs, manuscript materials, and more. We seek to prioritize uses of The Public Interest Corpus that promote learning, access to knowledge, and broad benefits to the public. 

    Project Team and Advisory Board

    The  project team consists of Dave Hansen, Executive Director of Authors Alliance and Dan Cohen, Vice Provost for Information Collaboration, Dean of the Library, and Professor of History at Northeastern University. In January, I joined the team as the Public Interest AI Strategist. In this capacity I will leverage extensive experience developing community around responsible computational use of memory organization collections as data and responsible AI.  Giulia Taurino, recently joined the team as Project Coordinator. Giulia holds a doctoral degree in Media Studies and Visual Arts from the University of Bologna and the University of Montreal and is currently a member of the NULab for Digital Humanities and Computational Social Science and of AI & Arts interest group at The Alan Turing Institute.

    The project team is guided by a strong advisory board composed of senior leaders and experts who think deeply about how authors, libraries, and AI can better serve the public interest. 

    • David Bamman, Associate Professor, UC Berkeley School of Information
    • Sandra Aya Enimil, Director of Scholarly Communications and Collection Strategy, Yale University Library
    • Mike Furlough, Executive Director, HathiTrust
    • David Smith, Associate Professor, Khoury College of Computer Sciences, Northeastern University
    • Claire Stewart, Dean of Libraries and University Librarian, University of Illinois, Urbana-Champaign 
    • Mehtab Khan, Assistant Professor of Law at Cleveland State University College of Law
    • Rachael Samberg, Director,  Scholarly Communications and Information Policy, UC Berkeley Library
    • Robin Sloan, NY Times best selling science fiction author
    • Günter Waibel, Associate Vice Provost & Executive Director, California Digital Library
    • Martha Whitehead, Vice President for the Harvard Library and University Librarian, Harvard University
    • John Wilkin, CEO, LYRASIS
    • Suzanne Wones, University Librarian, UC Berkeley Library
    • Ted Underwood, Professor of Information Science and English, University of Illinois at Urbana Champaign

    How you can get involved 

    Over the next year the project team will engage a diverse set of stakeholders in a co-development process that directly informs The Public Interest Corpus priorities, strategies, and partnerships. To kick things off we are holding a working event at Northeastern University Library in Boston, Massachusetts on March 3 where a group of senior library administrators, publishers, disciplinary researchers, authors, and technical experts will workshop core legal, technical, business model, and governance challenges. 

    Moving forward we intend to hold additional focused in-person and virtual working events with a broad range of communities. We strongly believe that engaging with diverse stakeholders in a co-development process for this effort will be key to success. If you are interested in participating in a future event, hosting a Public Interest Corpus event, or have other ideas for how we might collaborate please let us know via the following form.

    We look forward to advancing a public interest solution with you all.