Institut für Informatik, Universität Freiburg

 Book-Crossing Dataset ... mined by Cai-Nicolas Ziegler, DBIS Freiburg

Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.

[ ! ] Freely available for research use when acknowledged with the following reference (further details on the dataset are given in this publication):

As a courtesy, if you use the data, I would appreciate knowing your name, what research group you are in, and the publications that may result.


The Book-Crossing dataset comprises 3 tables.
  • BX-Users
    Contains the users. Note that user IDs (`User-ID`) have been anonymized and map to integers. Demographic data is provided (`Location`, `Age`) if available. Otherwise, these fields contain NULL-values.

  • BX-Books
    Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (`Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (`Image-URL-S`, `Image-URL-M`, `Image-URL-L`), i.e., small, medium, large. These URLs point to the Amazon web site.

  • BX-Book-Ratings
    Contains the book rating information. Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.


Various data format flavours are available. Note that all downloads are in .ZIP format
  • Contains both schema information and data insertion statements. More convenient to use. Run as an SQL-script.

  • Data as comma-separated values (CSV). The first line contains column names. Field separators are given by semicola, all entries are in quotes.

Other Datasets

  • Offers collaborative filtering (CF) datasets for movies. MovieLens datasets come in different sizes. Also links to the older EachMovie dataset that can be obtained upon request from Compaq.

  • Dense dataset for joke recommending. Large numbers of users, but small number of items (around 100 jokes) only.

[ IIF ]   [ DBIS ]   [ Cai-Nicolas Ziegler ]