95

I write dataset instead of data set, in the same way I write database instead of data base.

Looking at some English dictionaries, I don't find dataset.

Does that mean dataset isn't correct and I should use data set?

apaderno
  • 59,185
  • 3
    *dataset* for certain datasets; *data set* for any set for data in general. In specific contexts, a dataset needs to satisfy conditions to qualify as a dataset. Any set of any data can be called a data set, unqualified. – Kris Jan 12 '12 at 10:27
  • 5
    I note that googling the NIPS website that contains many academic papers with datasets I find that "data set" reports 1.890 results and "dataset" 2.660 results. The same pattern is seen for plural (datasets/data sets). I would suggest using "dataset". – Finn Årup Nielsen Nov 13 '15 at 14:05
  • @FinnÅrupNielsen which googling the NIPS website are you using for this check? – alper Aug 09 '22 at 11:38
  • @alper I am googling, e.g., "data sets" site:nips.cc – Finn Årup Nielsen Aug 11 '22 at 09:57

5 Answers5

44

As @mmyers notes, dataset does not appear in any dictionaries. However, there are 172 incidences in the Corpus of Contemporary American English, and all but a handful are in the “academic” section, representing formal academic writing. Its lack of appearance in dictionaries is probably because it is a fairly new coinage, the two examples from the Corpus of Historical American English are from 2001. Nothing from before then. Interestingly, the British National Corpus has 51 incidences, dating from the 1980s to the mid 1990s.

nohat
  • 68,560
  • 20
    There is an entry now in the Cambridge English dictionary https://dictionary.cambridge.org/dictionary/english/dataset – Daniel May 12 '20 at 13:27
35

Wiktionary says they are equivalent, but neither Merriam-Webster nor Dictionary.com has an entry.

Given that information, I guess I would classify dataset as technical jargon, but it's really not much of a jargon term. Any technical audience would have no problem with it; a non-technical audience should still easily understand its meaning.

mmyers
  • 6,181
  • 22
    Is it possible that data set is written dataset for similitude with database? Has database ever been written as data base? – apaderno Aug 28 '10 at 21:55
  • 24
    @kiamlaluno: Yes, indeed. Database books from the 1980s and back used to spell it "data base" all the time. – CesarGon Jan 15 '11 at 18:42
  • 4
    so we should use dataset then since that is the future based on evolution of database word – adam Nov 07 '17 at 13:24
  • 2
    Today there is an entry in Dictionary.com for Data set. – Stanfrancisco Mar 20 '19 at 17:01
  • 13 years later, this seems prescient and the Merriam-Webster entry suggests 'dataset' is now more common. – sage Sep 01 '23 at 19:19
5

The APA Style Blog comes down firmly on the data set spelling. Although dataset is understandable, two words still seems to be preferred even in academic settings.

apaderno
  • 59,185
Quantum7
  • 183
4

As new tech terms appear and evolve, they tend to be spelled separately at first, and over time become more closely joined, either with a hyphen or no space at all. This evolution is occurring with data set (dataset) at the moment. According to Google’s Ngram viewer, data set was dominant from the outset until 2013, and authoritative sources like dictionaries and style guides (e.g. APA) reflect this, with some like Wikipedia giving dataset as an alternative (not incorrect) form.

However, in the past 15 years the trend has sharply reversed, and now Google’s corpus shows that dataset is almost twice as popular in British and global usage; Cambridge now gives only this form.

Data set was still holding strong in American usage until very recently (see Ngram and change the corpus to American English to see the graph). Dataset is poised to equal or bypass data set soon in popularity even in the States. Dictionaries are slow to reflect such changes; the AHD (American Heritage Dictionary) still gives only data set, while Merriam Webster gives neither.

You can expect this shift toward dataset to be reflected in dictionaries and style guides within the next five to ten years, and I expect that they will vary in which is given as the main entry, but most should soon recognize both forms.

Both are common enough that in my opinion you may choose freely between them, as long as you use one consistently in any particular document. There is certainly no difference in clarity. However, in the prenominal adj. position, as in dataset development, it seems logical to me that if your preferred style is data set, you would hyphenate here: data-set development, and if you prefer dataset, then it would be dataset development.

KillingTime
  • 6,206
1

"Dataset" is a word, just not a common one.

I consulted the OneLook Dictionary to find what dictionaries list "dataset". Try it yourself to find the results. For common words, OneLook will find it in 24 general dictionaries, plus various specialized dictionaries.

In this case, OneLook finds "dataset" in three general dictionaries (Wordnik, Wiktionary, Wikipedia), but also in: a computing dictionary, a medical dictionary, a meteorology dictionary, and a search engine dictionary.

GEdgar
  • 25,177
  • 1
    Did you click through to each dictionary? When I tried to replicate this, I found that the first dictionary I checked did not have an entry for "dataset" but rather it was a redirect to "data set". Another dictionary link led to a 404. The "search engine dictionary" link didn't even go to a dictionary entry but rather was a generic search box that when used gave me a list of two ads. – Laurel Aug 11 '22 at 16:54
  • @Laurel This is a good comment. – GEdgar Aug 11 '22 at 16:57