How machine learning and AI can be used in e-Discovery

How machine learning and AI can be used in e-Discovery / Forensic Analysis Discovery

Core to many court cases is the timeline of events and actions backed up by evidence. What many attorneys need is a precise and relevant virtual timeline consisting of the digital footprints everyone in this modern age leaves. While this in the past was done with a small army of people, the challenge is that with the avalanche of data that is produced and collected by smartphones, smartlocks, cameras, routers, social media, smart watches, etc., filtering through to get the relevant information is harder than finding a needle in the haystack even with large budgets using only human labor. What is worse, the opposition in a case many times will overproduce by copying and delivering irrelevant data to “bankrupt” the other side during production requests. With the amount of discovery data exponentially increasing and many cases being won or lost based on who can do better e-discovery, what can a legal team do practically?

The need to develop a concise provable virtual timeline is what is driving the adoption of Artificial Intelligence (AI) in e-discovery. The key advantages that AI provides are the ability to distill terabytes of data, make relevant connections between important data, and create a virtual timeline in a fraction of the time and effort vs. human based discovery coupled with non-AI based e-discovery software tools.

Finding relevant data in a discovery dump is the first task an e-discovery team has to do. This is extremely difficult due the wide variety of formats information is in from text, photos, video, audio, log files, paper files, etc.

For text based information, traditional non-AI software will be used to search for keywords in documents. This simplistic approach produces a lot of false positives and false negatives of supposed relevant information that a human needs to manually review and sort through. AI-approaches allow teams to automatically classify documents based on a much smaller set of documents (typically called “training data”) that have been manually determined to be relevant or not. This can reduce the need to read through 100,000 documents to only looking at 100 documents. What’s more, with Natural Language Processing (NLP), a subset of AI, e-discovery teams can browse through adjacent concepts in a document production which is important because sometimes what to search for isn’t exactly known. For example, for a car accident case, an attorney might want to find information of issues that could have caused the car accident like medical or mechanical issues. Manually searching through terabytes of documents with possible keywords is very tedious and expensive, and likely not to find the actual cause. With NLP, an attorney can automatically see all possible causes (e.g. prior DUIs) in the documents actually connected to that car accident. AI also can be used to take large volumes of text and automatically put in a tabular spreadsheet format under the proper column headings. This can be very useful in establishing a virtual timeline quickly and automatically. For example, for the sentences: “This morning I ate breakfast” “I didn’t sleep until 2am” If there are two columns in a spreadsheet like “time” and “action”, it would automatically produce tabular data like this:<

Time
Action
Morning
Ate breakfast 2am
Sleep

This can work even if the text is structured vastly different from each other. Another useful capability AI brings is sentiment analysis. As an example, this technology can be used to quickly surface all communication that was negative around a topic. With AI, uncovering truths becomes faster and easier than ever even with large volumes of documents.

On the other hand, when it comes time to produce documents, sometimes non-relevant Personally Identifiable Information (PII) needs to be redacted ahead of time. Eliminating this manually is error prone and expensive. Fortunately, with AI, this can be done automatically and reliably with just a few clicks even for PII that doesn’t match search strings (e.g. different social security numbers formatted in different ways). In addition, sometimes responses to discovery requests have to be generated. To bankrupt the other side, AI text generators can be used to create realistic and legally correct but voluminous amounts of text answers to force the opposition to waste attorney time without spending their own attorney time. AI has created this spy vs. spy arms race that is forcing everyone to use AI for ediscovery.

With all the security, traffic lights and especially smartphone cameras, a visual footprint is created for everyone nearly all the time. This has created tens of thousands of hours of photos and video that ediscovery teams have had to sift through by actually looking at the content. To solve this problem, automatic face, scene (e.g. find a certain person when they are on the ski slope) and text recognition (e.g. license plates) have become essential tools to find relevant video and photos in minutes. This automated visual search can also be done for custom images like tattoos, jewelry, clothing, etc. even for bad angles, poor quality or partial images. In addition, objects and people can also be tracked in video which can be useful for finding video proof that a neighbor poisoned their tree without having to watch week’s worth of Nest Cam video. AI has allowed finding the needle in the proverbial haystack in a fraction of the time and expense vs. human review.

Social media has proven to be essential evidence in court cases. This is even true for people who are not on social media. For instance, while a person may not be on social media, their friends might be and might post photos/videos of them or reference them in the text caption. However, manually browsing through hundreds of social media accounts to find relevant information is very expensive. Fortunately, using some of the AI technologies mentioned above, this search can be done automatically and much more accurately. For instance, computer vision can be used to find a person in a photo and then a program can then be instructed to do the same visual search in everyone else’s social media account for that person. This can be used to find a photo to prove that someone went snowboarding while claiming a disability for insurance.

The Internet of Things (IoT) has brought a plethora of monitoring devices (e.g. smart door locks, routers, Apple tracker tags, Roku TV) that people may not even realize track them and their actions. The data from these IoT devices can be essential in a case. While the human consumable data sources discussed above may seem voluminous, the data generated by IoT devices can be even more, and worse tends not to be easily comprehensible by the average person. However, the IoT data can be fed into a machine learning model (a type of AI) to make predictions as to what might be happening even if there is no actual eye witness or video proof of the situation. An example of this might be the Nest thermostat turning on more frequently to cool a Airbnb unit because an illegal party was held there. While this may not be submittable evidence, it can allow a legal team to piece together a timeline and focus efforts to find social media posts in the area of a party that might have trashed an Airbnb rental.

Once AI completes the analysis and pieces all the relevant information together, it is important to present the virtual timeline in a form that a legal team can use to uncover truths. This could involve putting all the documents/photos/links on a private web server so that the team can browse and interactively search through documents.

AI is bringing a whole set of new capabilities to the ediscovery field. This will save a tremendous amount of time and effort. More importantly AI can help surface truths that would have been impossible to find otherwise. In addition, the exponential volume being generated is making the use of AI essential for every court case.