Prime AI dataset pulls information from BitcoinTalk, Steemit, and U.S. SEC

by Jeremy

Advert

CoinDesk Consensus

Colossal Clear Crawled Corpus (C4), an AI dataset utilized by main tech firms, comprises information from numerous crypto-related web sites.

C4 dataset attracts from crypto websites

The Washington Publish and the Allen Institute for AI just lately analyzed the C4 dataset, rating web sites by the variety of “tokens” or textual content snippets taken from every supply.

The U.S. Securities and Alternate Fee — which partially comprises content material on cryptocurrency regulation — was among the many dataset’s largest sources. Its web site (sec.gov) ranked at #39 and accounted for 36 million, or 0.02%, of C4’s tokens.

Bitcointalk.org, a blockchain dialogue board created by Satoshi Nakamoto, ranked at #780. It accounted for six.1 million, or 0.004%, of C4’s tokens.

Cryptocurrency information and aggregation websites comparable to Cointelegraph and Coinmarketcap.com have been additionally represented. Eight such websites collectively accounted for at the least 0.008% of C4’s tokens, although different websites probably enhance the true whole.

Web sites associated to particular cryptocurrencies and exchanges have been additionally represented within the dataset however accounted for a negligible quantity of tokens.

Two crypto-adjacent websites additionally ranked extremely. IPFS (ipfs.io) ranked at #16 whereas Steemit (steemit.com) ranked at #594. The primary web site is a distributed community from the blockchain agency Protocol Labs, whereas the second makes direct use of blockchain. Nonetheless, these websites don’t essentially comprise content material associated to cryptocurrency.

Mainstream websites topped the record

The C4 dataset is utilized in AI language fashions from main tech firms together with Google’s T5 and Fb’s LLaMA, based on the Washington Publish.

Although the above websites are amongst C4’s most important crypto-related web sites, they’re outranked by mainstream web sites and information sources, which frequently cowl cryptocurrency matters and are probably the first supply for all crypto-related information.

C4 has additionally been criticized for holding hate speech and pirated information. Although the dataset’s identify means that it has been “cleaned,” its assemblers solely used an inventory of 400 phrases to censor particular content material, that means that controversial content material stays intact.

The presence of crypto websites, in addition to the presence of controversial information, may have an effect on the extent of bias seen in content material produced by AI chatbots.

Supply hyperlink

Related Posts

You have not selected any currency to display