AI Magazine May 2024

AI AND BIG DATA

Couldwell ’ s ‘ Day 1 ’ data problems explained “ To get started around RAG , you have to look at what data you have , what formats it exists in and how to get this ready for use with Generative AI . This can be unstructured data or structured data that exists in a variety of formats . All this data you have will then be turned into document objects that contain both text and associated metadata . This data is then split into smaller portions called chunks that can be indexed and understood .

“ The chunks are indexed and converted into vector embeddings , which capture the semantic relationships between concepts or objects in mathematical form . These vectors are stored in a vector database for future use .

“ Each of these steps is needed to get your data ready to search when a query comes in from a user . That query gets turned into a vector , which then gets used for a search comparison against all the information held in the vector database . The system finds the information that has the closest semantic match to the query , and then shares that information back to the LLM to build the response .

“ As an example , say you ask a retailer for product information around “ scarlet Nike sneakers ” - a traditional search engine would look for exact matches to that term , while a vector search would understand that “ scarlet ” is a synonym for “ red ”, that “ Nike ” is a brand name , and “ sneakers ” equates to “ trainers ” or

aimagazine . com 127

AI Magazine May 2024 | Page 127